[R] More efficient option to append()?

Fri Aug 19 15:56:10 CEST 2011

 On 08/18/2011 07:46 AM, Timothy Bates wrote:
> This takes a few seconds to do 1 million lines, and remains explicit/for loop form
>
> numberofSalaryBands = 1000000 # 2000000
> x        = sample(1:15,numberofSalaryBands, replace=T)
> y        = sample((1:10)*1000, numberofSalaryBands, replace=T)
> df       = data.frame(x,y)
> finalN   = sum(df$x)
> myVar    = rep(NA, finalN)
> outIndex = 1
> i        = 1
> for (i in 1:numberofSalaryBands) {
> 	kount = df$x[i]
> 	myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i] copies of value y[i]

For posterity, the problem in the code of the OP was that myVar was
continuously growing. This required the operating system to continuously
create more space for myVar, which is a very slow process. In this
example you preallocate the space needed for myVar by creating an object
of the appropriate length before the for loop.

So, in my opinion, for loops and append should be avoided like the plague!

my 2cts :)

Paul

> 	outIndex = outIndex+kount
> }
> head(myVar)
> plyr::count(myVar)
>
>
> On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote:
>
>>
>> Dear R community,
>>
>> I have a 2 million by 2 matrix that looks like this:
>>
>> x<-sample(1:15,2000000, replace=T)
>> y<-sample(1:10*1000, 2000000, replace=T)
>>      x     y
>> [1,] 10  4000
>> [2,]  3  1000
>> [3,]  3  4000
>> [4,]  8  6000
>> [5,]  2  9000
>> [6,]  3  8000
>> [7,]  2 10000
>> (...)
>>
>>
>> The first column is a population expansion factor for the number in the
>> second column (household income). I want to expand the second column
>> with the first so that I end up with a vector beginning with 10
>> observations of 4000, then 3 observations of 1000 and so on. In my mind
>> the natural approach would be to create a NULL vector and append the
>> expansions:
>>
>> myvar<-NULL
>> myvar<-append(myvar, replicate(x[1],y[1]), 1)
>>
>> for (i in 2:length(x)) {
>> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1)
>> }
>>
>> to end with a vector of sum(x), which in my real database corresponds
>> to 22 million observations.
>>
>> This works fine --if I only run it for the first, say, 1000
>> observations. If I try to perform this on all 2 million observations
>> it takes long, way too long for this to be useful (I left it running
>> 11 hours yesterday to no avail).
>>
>>
>> I know R performs well with operations on relatively large vectors. Why
>> is this so inefficient? And what would be the smart way to do this?
>>
>> Thanks in advance.
>> Alex
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770