[R] [FORGED] Splitting data.frame into a list of small data.frames given indices
Rolf Turner
r.turner at auckland.ac.nz
Wed Jun 29 12:00:58 CEST 2016
On 29/06/16 21:16, Witold E Wolski wrote:
> It's the inverse problem to merging a list of data.frames into a large
> data.frame just discussed in the "performance of do.call("rbind")"
> thread
>
> I would like to split a data.frame into a list of data.frames
> according to first column.
> This SEEMS to be easily possible with the function base::by. However,
> as soon as the data.frame has a few million rows this function CAN NOT
> BE USED (except you have A PLENTY OF TIME).
>
> for 'by' runtime ~ nrow^2, or formally O(n^2) (see benchmark below).
>
> So basically I am looking for a similar function with better complexity.
>
>
> > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>> timing <- list()
>> for(i in nrows){
> + dum <- peaks[1:i,]
> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
> + }
>> names(timing)<- nrows
>> timing
> $`1e+05`
> user system elapsed
> 0.05 0.00 0.05
>
> $`1e+06`
> user system elapsed
> 1.48 2.98 4.46
>
> $`2e+06`
> user system elapsed
> 7.25 11.39 18.65
>
> $`3e+06`
> user system elapsed
> 16.15 25.81 41.99
>
> $`5e+06`
> user system elapsed
> 43.22 74.72 118.09
I'm not sure that I follow what you're doing, and your example is not
reproducible, since we have no idea what "peaks" is, but on a toy
example with 5e6 rows in the data frame I got a timing result of
user system elapsed
0.379 0.025 0.406
when I applied split(). Is this adequately fast? Seems to me that if
you want to split something, split() would be a good place to start.
cheers,
Rolf Turner
--
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
More information about the R-help
mailing list