[R] Why are big data.frames slow? What can I do to get it faster?

Peter Dalgaard BSA p.dalgaard at biostat.ku.dk
Mon Oct 7 14:02:01 CEST 2002

"Marcus Jellinghaus" <Marcus_Jellinghaus at gmx.de> writes:

> First I want to say "thank you" to everybody who replied.
> I understand that vectorized operations instead of the loop are faster.
> I also made sure not to use factors.
> Since the loop runs 100times in my example, the loop should only take the
> time of the vectorized operation mutliplied by 100.
> But the loop takes about 10 minutes, the  vectorized operation takes about 3
> seconds. (See below)
> Why that? Shouldn´t the loop take max 100*3seconds = 5 minutes?

You'll likely have to invoke the garbage collector a couple of times,
and there might also be issues of memory growth kicking in. Once you
get beyond some threshold, the machine starts swapping bits and pieces
of the workspace in and out of physical memory,

It's somewhat difficult to reproduce the behaviour, since you only give
part of the code necessary (e.g. how many *columns* do you have in
your data frame?) 

Something like this?

N <- 100000
test <- as.data.frame(lapply(1:6,function(i)rnorm(N)))
unix.time(test[1:100,6] <- paste(test[1:100,2],"-",test[1:100,3], sep = ""))
unix.time(for (i in 1:100) test[i,6] <- paste(test[i,2],"-",test[i,3], sep = ""))

(Using N==500000 made my little desktop swap like crazy, but the above
gave something like 2s CPU time for the 1st case and 92s CPU + 23s
system for the other one with R 1.6.0)

   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list