[R] For loop gets exponentially slower as dataset gets larger...

Tue Jan 3 20:30:34 CET 2006

Accepting this stacked representation for the
moment try this.  When reordering the dates do it
in reverse order.  Then loop over all codes
applying the zoo function na.locf to the the
prices for that code.  locf stands for last
observation carried forward.  Since our dates
are reversed it will bring the next one
backwards. Finally sort back into ascending order.

library(zoo)     # needed for na.locf which also works for non-zoo objects
data <- data[order(data$code, - as.numeric(data$date_)),]
attach(data)
next_price <- price
for(i in unique(code)) next_price[code==i] <- na.locf(price[code==i], na.rm=F)
data$next_price <- next_price
data <- data[order(data$code, data$date_),]
detach()

Here it is again but this time we represent it as
a list of zoo objects with one component per code.
In the code below we split the data on code and
apply f to do that.  Note that na.locf replaces
NAs with the last observation carried forward so
by reversing the data, using na.locf and reversing
the data again we get the effect.

library(zoo)
f <- function(x) {
	z <- zoo(x$price, x$date_)
	next_price <- rev(na.locf(rev(coredata(z)), na.rm = FALSE))
	merge(z, next_price)
}
z <- lapply(split(data, data$code), f)

On 1/3/06, r user <ruser2006 at yahoo.com> wrote:
> I am running R 2.1.1 in a Microsoft Windows XP environment.
>
>  I have a matrix with three vectors ("columns") and ~2 million "rows".  The three vectors are date_, id, and price.  The data is ordered (sorted) by code and date_.
>
>  (The matrix contains daily prices for several thousand stocks, and has ~2 million "rows". If a stock did not trade on a particular date, its price is set to "NA")
>
>  I wish to add a fourth vector that is "next_price". ("Next price" is the current price as long as the current price is not "NA".  If the current price is NA, the "next_price" is the next price that the security with this same ID trades.  If the stock does not trade again,  "next_price" is set to NA.)
>
>  I wrote the following loop to calculate next_price.  It works as intended, but I have one problem.  When I have only 10,000 rows of data, the calculations are very fast.  However, when I run the loop on the full 2 million rows, it seems to take ~ 1 second per row.
>
>  Why is this happening?  What can I do to speed the calculations when running the loop on the full 2 million rows?
>
>  (I am not running low on memory, but I am maxing out my CPU at 100%)
>
>  Here is my code and some sample data:
>
>  data<- data[order(data$code,data$date_),]
>  l<-dim(data)[1]
>  w<-3
>  data[l,w+1]<-NA
>
>  for (i in (l-1):(1)){
>  data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[i,2]==data[i+1,2],data[i+1,w+1],NA))
>  }
>
>
>  date      id         price     next_price
>  6/24/2005        1635    444.7838         444.7838
>  6/27/2005        1635    448.4756         448.4756
>  6/28/2005        1635    455.4161         455.4161
>  6/29/2005        1635    454.6658         454.6658
>  6/30/2005        1635    453.9155         453.9155
>  7/1/2005          1635    453.3153         453.3153
>  7/4/2005          1635    NA      453.9155
>  7/5/2005          1635    453.9155         453.9155
>  7/6/2005          1635    453.0152         453.0152
>  7/7/2005          1635    452.8651         452.8651
>  7/8/2005          1635    456.0163         456.0163
>  12/19/2005      1635    442.6982         442.6982
>  12/20/2005      1635    446.5159         446.5159
>  12/21/2005      1635    452.4714         452.4714
>  12/22/2005      1635    451.074           451.074
>  12/23/2005      1635    454.6453         454.6453
>  12/27/2005      1635    NA      NA
>  12/28/2005      1635    NA      NA
>  12/1/2003        1881    66.1562           66.1562
>  12/2/2003        1881    64.9192           64.9192
>  12/3/2003        1881    66.0078           66.0078
>  12/4/2003        1881    65.8098           65.8098
>  12/5/2003        1881    64.1275           64.1275
>  12/8/2003        1881    64.8697           64.8697
>  12/9/2003        1881    63.5337           63.5337
>  12/10/2003      1881    62.9399           62.9399
>
>
> ---------------------------------
>
>        [[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>