[R] SLOW split() function
Dennis Murphy
djmuser at gmail.com
Tue Oct 11 05:36:14 CEST 2011
I tried this:
library(data.table)
N <- 1000
T <- N*10
d <- data.table(gp= rep(1:T, rep(N,T)), val=rnorm(N*T), key = 'gp')
dim(d)
[1] 10000000 2
# On my humble 8Gb system,
> system.time(l <- d[, split(val, gp)])
user system elapsed
4.15 0.09 4.27
I wouldn't be surprised if there were a much faster way to do this
operation in data.table since split() is a data frame operation. This
is about as fast as Jim Holtman's suggestion:
system.time(s <- split(seq_len(nrow(d)), d$gp))
user system elapsed
4.15 0.09 4.29
HTH,
Dennis
On Mon, Oct 10, 2011 at 6:01 PM, ivo welch <ivo.welch at gmail.com> wrote:
> dear R experts: apologies for all my speed and memory questions. I
> have a bet with my coauthors that I can make R reasonably efficient
> through R-appropriate programming techniques. this is not just for
> kicks, but for work. for benchmarking, my [3 year old] Mac Pro has
> 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>
> right now, it seems that 'split()' is why I am losing my bet. (split
> is an integral component of *apply() and by(), so I need split() to be
> fast. its resulting list can then be fed, e.g., to mclapply().) I
> made up an example to illustrate my ills:
>
> library(data.table)
> N <- 1000
> T <- N*10
> d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
> setkey(d, "key"); gc() ## force a garbage collection
> cat("N=", N, ". Size of d=", object.size(d)/1024/1024, "MB\n")
> print(system.time( s<-split(d, d$key) ))
>
> My ordered input data table (or data frame; doesn't make a difference)
> is 114MB in size. it takes about a second to create. split() only
> needs to reshape it. this simple operation takes almost 5 minutes on
> my computer.
>
> with a data set that is larger, this explodes further.
>
> am I doing something wrong? is there an alternative to split()?
>
> sincerely,
>
> /iaw
>
> ----
> Ivo Welch (ivo.welch at gmail.com)
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list