[R] More efficient option to append()?
Alex Ruiz Euler
rruizeuler at ucsd.edu
Fri Aug 19 21:08:11 CEST 2011
Thanks for the code corrections. I see how for loops, append and
naively populating a NULL vector can be so resource consuming. I tried
the codes with 20 million observations in the following machine:
processor : 7
cpu family : 6
model name : Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz
cpu MHz : 933.000
cache size : 6144 KB
First I tried Timothy's code and left it running for half an hour and I
had to interrupt the command at
Timing stopped at: 1033.516 829.147 1845.648
Then Dennis' option:
user system elapsed
25.793 0.224 25.784
And for Paul's option, using a vector of length 20 million I had to
stop at:
Timing stopped at: 850.577 8.868 851.464
Not very efficient for relatively large vectors. I have also read
that using {} instead of () to wrap for example {x+1} works
faster, as do working directly with matrices instead of dataframes.
Thanks for your input.
Alex
On Fri, 19 Aug 2011 13:58:09 +0000
Paul Hiemstra <paul.hiemstra at knmi.nl> wrote:
> As I already stated in my reply to your earlier post:
>
> resending the answer for the archives of the mailing list...
>
> Hi Alex,
>
> The other reply already gave you the R way of doing this while avoiding
> the for loop. However, there is a more general reason why your for loop
> is terribly inefficient. A small set of examples:
>
> largeVector = runif(10e4)
> outputVector = NULL
> system.time(for(i in 1:length(largeVector)) {
> outputVector = append(outputVector, largeVector[i] + 1)
> })
> # user system elapsed
> # 6.591 0.168 6.786
>
> The problem in this code is that outputVector keeps on growing and
> growing. The operating system needs to allocate more and more space as
> the object grows. This process is really slow. Several (much) faster
> alternatives exist:
>
> # Pre-allocating the outputVector
> outputVector = rep(0,length(largeVector))
> system.time(for(i in 1:length(largeVector)) {
> outputVector[i] = largeVector[i] + 1
> })
> # user system elapsed
> # 0.178 0.000 0.178
> # speed up of 37 times, this will only increase for large
> # lengths of largeVector
>
> # Using apply functions
> system.time(outputVector <- sapply(largeVector, function(x) return(x + 1)))
> # user system elapsed
> # 0.124 0.000 0.125
> # Even a bit faster
>
> # Using vectorisation
> system.time(outputVector <- largeVector + 1)
> # user system elapsed
> # 0.000 0.000 0.001
> # Practically instant, 6780 times faster than the first example
>
> It is not always clear which method is most suitable and which performs
> best. At least they all perform much, much better than the naive option
> of letting outputVector grow.
>
> cheers,
> Paul
>
>
>
> On 08/17/2011 11:17 PM, Alex Ruiz Euler wrote:
> >
> > Dear R community,
> >
> > I have a 2 million by 2 matrix that looks like this:
> >
> > x<-sample(1:15,2000000, replace=T)
> > y<-sample(1:10*1000, 2000000, replace=T)
> > x y
> > [1,] 10 4000
> > [2,] 3 1000
> > [3,] 3 4000
> > [4,] 8 6000
> > [5,] 2 9000
> > [6,] 3 8000
> > [7,] 2 10000
> > (...)
> >
> >
> > The first column is a population expansion factor for the number in the
> > second column (household income). I want to expand the second column
> > with the first so that I end up with a vector beginning with 10
> > observations of 4000, then 3 observations of 1000 and so on. In my mind
> > the natural approach would be to create a NULL vector and append the
> > expansions:
> >
> > myvar<-NULL
> > myvar<-append(myvar, replicate(x[1],y[1]), 1)
> >
> > for (i in 2:length(x)) {
> > myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
> > }
> >
> > to end with a vector of sum(x), which in my real database corresponds
> > to 22 million observations.
> >
> > This works fine --if I only run it for the first, say, 1000
> > observations. If I try to perform this on all 2 million observations
> > it takes long, way too long for this to be useful (I left it running
> > 11 hours yesterday to no avail).
> >
> >
> > I know R performs well with operations on relatively large vectors. Why
> > is this so inefficient? And what would be the smart way to do this?
> >
> > Thanks in advance.
> > Alex
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
> --
> Paul Hiemstra, Ph.D.
> Global Climate Division
> Royal Netherlands Meteorological Institute (KNMI)
> Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
> P.O. Box 201 | 3730 AE | De Bilt
> tel: +31 30 2206 494
>
> http://intamap.geo.uu.nl/~paul
> http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
More information about the R-help
mailing list