[R] For loop gets exponentially slower as dataset gets larger...
Gabor Grothendieck
ggrothendieck at gmail.com
Tue Jan 3 20:30:34 CET 2006
Accepting this stacked representation for the
moment try this. When reordering the dates do it
in reverse order. Then loop over all codes
applying the zoo function na.locf to the the
prices for that code. locf stands for last
observation carried forward. Since our dates
are reversed it will bring the next one
backwards. Finally sort back into ascending order.
library(zoo) # needed for na.locf which also works for non-zoo objects
data <- data[order(data$code, - as.numeric(data$date_)),]
attach(data)
next_price <- price
for(i in unique(code)) next_price[code==i] <- na.locf(price[code==i], na.rm=F)
data$next_price <- next_price
data <- data[order(data$code, data$date_),]
detach()
Here it is again but this time we represent it as
a list of zoo objects with one component per code.
In the code below we split the data on code and
apply f to do that. Note that na.locf replaces
NAs with the last observation carried forward so
by reversing the data, using na.locf and reversing
the data again we get the effect.
library(zoo)
f <- function(x) {
z <- zoo(x$price, x$date_)
next_price <- rev(na.locf(rev(coredata(z)), na.rm = FALSE))
merge(z, next_price)
}
z <- lapply(split(data, data$code), f)
On 1/3/06, r user <ruser2006 at yahoo.com> wrote:
> I am running R 2.1.1 in a Microsoft Windows XP environment.
>
> I have a matrix with three vectors ("columns") and ~2 million "rows". The three vectors are date_, id, and price. The data is ordered (sorted) by code and date_.
>
> (The matrix contains daily prices for several thousand stocks, and has ~2 million "rows". If a stock did not trade on a particular date, its price is set to "NA")
>
> I wish to add a fourth vector that is "next_price". ("Next price" is the current price as long as the current price is not "NA". If the current price is NA, the "next_price" is the next price that the security with this same ID trades. If the stock does not trade again, "next_price" is set to NA.)
>
> I wrote the following loop to calculate next_price. It works as intended, but I have one problem. When I have only 10,000 rows of data, the calculations are very fast. However, when I run the loop on the full 2 million rows, it seems to take ~ 1 second per row.
>
> Why is this happening? What can I do to speed the calculations when running the loop on the full 2 million rows?
>
> (I am not running low on memory, but I am maxing out my CPU at 100%)
>
> Here is my code and some sample data:
>
> data<- data[order(data$code,data$date_),]
> l<-dim(data)[1]
> w<-3
> data[l,w+1]<-NA
>
> for (i in (l-1):(1)){
> data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[i,2]==data[i+1,2],data[i+1,w+1],NA))
> }
>
>
> date id price next_price
> 6/24/2005 1635 444.7838 444.7838
> 6/27/2005 1635 448.4756 448.4756
> 6/28/2005 1635 455.4161 455.4161
> 6/29/2005 1635 454.6658 454.6658
> 6/30/2005 1635 453.9155 453.9155
> 7/1/2005 1635 453.3153 453.3153
> 7/4/2005 1635 NA 453.9155
> 7/5/2005 1635 453.9155 453.9155
> 7/6/2005 1635 453.0152 453.0152
> 7/7/2005 1635 452.8651 452.8651
> 7/8/2005 1635 456.0163 456.0163
> 12/19/2005 1635 442.6982 442.6982
> 12/20/2005 1635 446.5159 446.5159
> 12/21/2005 1635 452.4714 452.4714
> 12/22/2005 1635 451.074 451.074
> 12/23/2005 1635 454.6453 454.6453
> 12/27/2005 1635 NA NA
> 12/28/2005 1635 NA NA
> 12/1/2003 1881 66.1562 66.1562
> 12/2/2003 1881 64.9192 64.9192
> 12/3/2003 1881 66.0078 66.0078
> 12/4/2003 1881 65.8098 65.8098
> 12/5/2003 1881 64.1275 64.1275
> 12/8/2003 1881 64.8697 64.8697
> 12/9/2003 1881 63.5337 63.5337
> 12/10/2003 1881 62.9399 62.9399
>
>
> ---------------------------------
>
> [[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>
More information about the R-help
mailing list