[R] Optimise huge data.frame construction

Moshe Olshansky m_olshansky at yahoo.com
Wed Feb 24 11:09:02 CET 2010


Hi Daniele,

One possibility would be to make two runs. In the first run you are not building the matrix but just calculating the number of rows you need (in a loop). Then you allocate such matrix (only once) and fill it in the second run.

Regards,
Moshe.

--- On Wed, 24/2/10, Daniele Amberti <daniele.amberti at ors.it> wrote:

> From: Daniele Amberti <daniele.amberti at ors.it>
> Subject: [R] Optimise huge data.frame construction
> To: "r-help at r-project.org" <r-help at r-project.org>
> Received: Wednesday, 24 February, 2010, 8:34 PM
> I have data for different items (ID)
> in a database.
> For each ID I have to get:
> 
> -          Timestamp of the
> observation (timestamp);
> 
> -          numerical value (val)
> that will be my response variable in some kind of model;
> 
> -          a variable number of
> variables in a know set (if value for a specific variable is
> not present in DB it is 0).
> 
> To get to the above mentioned values I have to cycle over
> IDs, make some calculation and store results to construct a
> huge data.frame for subsequent estimations. The number of
> rows for each ID is random (typically 14 to 200).
> 
> My current approach is to construct a matrix like this:
> 
> out <- c('A', 'B', 'C', 'D')
> out <- matrix(-1, 5000, 3 + length(out), dimnames =
> list(1:5000, c('ID', 'timestamp' , 'val', out)))
> 
> I access to out matrix by numerical index to substitute
> values ( out[1:n,1] <- k )
> When matrix is full I add 5000 rows and go on.
> Afterward I clean rows with ID set to -1 and than all other
> -1 values with 0
> 
> For my application typically an ID have something between
> 14 and 200 observations (mean around 50) but I have 15000
> IDs ...
> After profiling I realize that accessing the out matrix
> this way is too slow.
> 
> Do you have any idea on how to speed up this kind of
> process?
> I think something can be done creating a data.frame for
> each ID and bind them in the end. Is it a good idea? How can
> I implement that? List of data.frame? And than?
> 
> Below some code that can be useful if someone would like to
> experiment ...
> 
> alist <- vector('list', 2)
> alist[[1]] <- data.frame( ID = 1, timestamp = 1:14, val
> = rnorm(14), A = 1, B = 2, C = 3 )
> alist[[2]] <- data.frame( ID = 2, timestamp = 2:15, val
> = rnorm(14), B = 2, C = 3, D = 4 )
> alist[[3]] <- data.frame( ID = 3, timestamp = 3:30, val
> = rnorm(28), C = 1, D = 2 )
> 
> 
> Thanks in advance for your valuable help.
> Daniele
> 
> ________________________________
> ORS Srl
> 
> Via Agostino Morando 1/3 12060 Roddi (Cn) - Italy
> Tel. +39 0173 620211
> Fax. +39 0173 620299 / +39 0173 433111
> Web Site www.ors.it
> 
> ------------------------------------------------------------------------------------------------------------------------
> Qualsiasi utilizzo non autorizzato del presente messaggio e
> dei suoi allegati ? vietato e potrebbe costituire reato.
> Se lei avesse ricevuto erroneamente questo messaggio, Le
> saremmo grati se provvedesse alla distruzione dello stesso
> e degli eventuali allegati.
> Opinioni, conclusioni o altre informazioni riportate nella
> e-mail, che non siano relative alle attivit? e/o
> alla missione aziendale di O.R.S. Srl si intendono non
> attribuibili alla societ? stessa, n? la impegnano in alcun
> modo.
> 
>     [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.
>



More information about the R-help mailing list