[R] dramatic speed difference in lapply
Tom Short
tshort.rlists at gmail.com
Fri Feb 26 22:21:14 CET 2010
I'm sorry, Rob, but that code is dense enough and formatted badly
enough that it's hard to dig through.
You may want to try the data.table package. The development version on
R-forge is pretty fast for grouping operations like this. I'm not sure
if this is what you're really after. It's hard to tell from your
example.
Compare some speeds:
> dat <- data.frame(D=sample(32000:33000, 666000,T),
+ Fid=sample(1:10,666000,T),
+ A=sample(1:5,666000,T))
>
> ####### one of your examples
> system.time(ret <- fedb.ddplyWrapper2(dat, c("D", "Fid"),
+ function(x) c(sum(x[,"A"], na.rm=T),
sum(x[,"A"], na.rm=T))))
user system elapsed
21.78 14.42 36.35
>
>
> ####### data.table
> install.packages("data.table",repos="http://R-Forge.R-project.org")
> library(data.table)
> dt <- as.data.table(dat)
> system.time(ret2 <- dt[, sum(A, na.rm=T), by = "D,Fid"])
user system elapsed
0.27 0.00 0.28
>
>
> ####### plyr for comparison, too
> library(plyr)
> system.time(ret3 <- ddply(dat, .(D,Fid), function(x) sum(x$A, na.rm=T)))
user system elapsed
28.94 12.16 41.23
> head(ret)
[,1] [,2]
1 175 175
2 222 222
3 221 221
4 134 134
5 253 253
6 194 194
> head(ret2)
D Fid V1
[1,] 32000 1 228
[2,] 32000 2 209
[3,] 32000 3 182
[4,] 32000 4 180
[5,] 32000 5 181
[6,] 32000 6 222
> head(ret3)
D Fid V1
1 32000 1 175
2 32000 2 222
3 32000 3 221
4 32000 4 134
5 32000 5 253
6 32000 6 194
- Tom
On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rforler at uchicago.edu> wrote:
> So I have a function that does lapply's for me based on dimension. Currently
> only works for length(pivotColumns)=2 because I haven't fixed the rbinds. I
> have two versions. One runs WAYYY faster than the other. And I'm not sure
> why.
>
> Fast Version:
>
> fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions,
> ...){
> lapplyFunctionRecurse <- function(cdata, level=1, ...){
> if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
> } else if (level==length(pivotColumns)) {
> #
> return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T),
> function(x, ...) listNameFunctions(data[x,], ...)))
> return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]],
> data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T),
> sum(data[cdata,"A"], na.rm=T))))
> } else {
> return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> }
> }
> result = lapplyFunctionRecurse(data, ...)
> matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
> return(matrix2)
> }
>
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> proc.time()-temp
> user system elapsed
> 4.616 0.006 4.630
> #note in thie case the anonymous function I pass in isn't used because I
> hardcode the function into the lapply.
>
> approx 4 seconds
>
> This runs very fast. This runs very slow:
>
> fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){
> lapplyFunctionRecurse <- function(cdata, level=1, ...){
> if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
> } else if (level==length(pivotColumns)) {
> #this line is different. it essentially calls the function you
> pass in
> return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) listNameFunctions(data[x,], ...)))
> } else {
> return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> }
> }
> result = lapplyFunctionRecurse(data, ...)
> matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
> return(matrix2)
> }
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> proc.time()-temp
> user system elapsed
> 16.346 65.059 81.680
>
> head(ret3)
D Fid V1
1 32000 1 175
2 32000 2 222
3 32000 3 221
4 32000 4 134
5 32000 5 253
6 32000 6 194
>
>
> Can anyone explain to me why there is a 4x time difference? I don't want to
> have to hardcore into the recursion function, but if I have to I will.
>
> Thanks,
> Rob
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list