[R] Significant performance difference between split of a data.frame and split of vectors
Peng Yu
pengyu.ut at gmail.com
Wed Dec 9 05:28:06 CET 2009
I have the following code, which tests the split on a data.frame and
the split on each column (as vector) separately. The runtimes are of
10 time difference. When m and k increase, the difference become even
bigger.
I'm wondering why the performance on data.frame is so bad. Is it a bug
in R? Can it be improved?
> system.time(split(as.data.frame(x),f))
user system elapsed
1.700 0.010 1.786
>
> system.time(lapply(
+ 1:dim(x)[[2]]
+ , function(i) {
+ split(x[,i],f)
+ }
+ )
+ )
user system elapsed
0.170 0.000 0.167
###########
m=30000
n=6
k=3000
set.seed(0)
x=replicate(n,rnorm(m))
f=sample(1:k, size=m, replace=T)
system.time(split(as.data.frame(x),f))
system.time(lapply(
1:dim(x)[[2]]
, function(i) {
split(x[,i],f)
}
)
)
More information about the R-help
mailing list