[R] More efficient option to append()?

Fri Aug 19 21:08:11 CEST 2011

Thanks for the code corrections. I see how for loops, append and
naively populating a NULL vector can be so resource consuming. I tried
the codes with 20 million observations in the following machine:

processor	: 7
cpu family	: 6
model name	: Intel(R) Core(TM) i7 CPU       Q 720  @ 1.60GHz
cpu MHz		: 933.000
cache size	: 6144 KB

First I tried Timothy's code and left it running for half an hour and I
had to interrupt the command at

Timing stopped at: 1033.516 829.147 1845.648 

Then Dennis' option:

 user  system elapsed 
 25.793   0.224  25.784 

And for Paul's option, using a vector of length 20 million I had to
stop at: 

Timing stopped at: 850.577 8.868 851.464 

Not very efficient for relatively large vectors. I have also read
that using {} instead of () to wrap for example {x+1} works
faster, as do working directly with matrices instead of dataframes.

Thanks for your input.

Alex

On Fri, 19 Aug 2011 13:58:09 +0000
Paul Hiemstra <paul.hiemstra at knmi.nl> wrote:

>  As I already stated in my reply to your earlier post:
> 
> resending the answer for the archives of the mailing list...
> 
> Hi Alex,
> 
> The other reply already gave you the R way of doing this while avoiding
> the for loop. However, there is a more general reason why your for loop
> is terribly inefficient. A small set of examples:
> 
> largeVector = runif(10e4)
> outputVector = NULL
> system.time(for(i in 1:length(largeVector)) {
>     outputVector = append(outputVector, largeVector[i] + 1)
> })
> #   user  system elapsed
>  # 6.591   0.168   6.786
> 
> The problem in this code is that outputVector keeps on growing and
> growing. The operating system needs to allocate more and more space as
> the object grows. This process is really slow. Several (much) faster
> alternatives exist:
> 
> # Pre-allocating the outputVector
> outputVector = rep(0,length(largeVector))
> system.time(for(i in 1:length(largeVector)) {
>     outputVector[i] = largeVector[i] + 1
> })
> #   user  system elapsed
> # 0.178   0.000   0.178
> # speed up of 37 times, this will only increase for large
> # lengths of largeVector
> 
> # Using apply functions
> system.time(outputVector <- sapply(largeVector, function(x) return(x + 1)))
> #   user  system elapsed
> #  0.124   0.000   0.125
> # Even a bit faster
> 
> # Using vectorisation
> system.time(outputVector <- largeVector + 1)
> #   user  system elapsed
> #  0.000   0.000   0.001
> # Practically instant, 6780 times faster than the first example
> 
> It is not always clear which method is most suitable and which performs
> best. At least they all perform much, much better than the naive option
> of letting outputVector grow.
> 
> cheers,
> Paul
> 
> 
> 
> On 08/17/2011 11:17 PM, Alex Ruiz Euler wrote:
> >
> > Dear R community,
> >
> > I have a 2 million by 2 matrix that looks like this:
> >
> > x<-sample(1:15,2000000, replace=T)
> > y<-sample(1:10*1000, 2000000, replace=T)
> >       x     y
> > [1,] 10  4000
> > [2,]  3  1000
> > [3,]  3  4000
> > [4,]  8  6000
> > [5,]  2  9000
> > [6,]  3  8000
> > [7,]  2 10000
> > (...)
> >
> >
> > The first column is a population expansion factor for the number in the
> > second column (household income). I want to expand the second column
> > with the first so that I end up with a vector beginning with 10
> > observations of 4000, then 3 observations of 1000 and so on. In my mind
> > the natural approach would be to create a NULL vector and append the
> > expansions:
> >
> > myvar<-NULL
> > myvar<-append(myvar, replicate(x[1],y[1]), 1)
> >
> > for (i in 2:length(x)) {
> > myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
> > }
> >
> > to end with a vector of sum(x), which in my real database corresponds
> > to 22 million observations.
> >
> > This works fine --if I only run it for the first, say, 1000
> > observations. If I try to perform this on all 2 million observations
> > it takes long, way too long for this to be useful (I left it running
> > 11 hours yesterday to no avail).
> >
> >
> > I know R performs well with operations on relatively large vectors. Why
> > is this so inefficient? And what would be the smart way to do this?
> >
> > Thanks in advance.
> > Alex
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> 
> -- 
> Paul Hiemstra, Ph.D.
> Global Climate Division
> Royal Netherlands Meteorological Institute (KNMI)
> Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
> P.O. Box 201 | 3730 AE | De Bilt
> tel: +31 30 2206 494
> 
> http://intamap.geo.uu.nl/~paul
> http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770