[R] Help with R
Christoph Lehmann
christoph.lehmann at gmx.ch
Thu May 5 15:05:55 CEST 2005
>>I
>>heard that 'R' does not do a very good job at handling large datasets, is
>>this true?
>
importing huge datasets in a data.frame with e.g. a subsequent step of
conversion of some columns into factors may lead into memory troubles
(probably due to memory overhead when building out factors). But we
currently succeeded in importing 12 millions of data records stored in
a MySQL database, using RMySQL package. The procedure which lead to
success was:
0 define a data.frame 'data.total' with the size necessary to keep the
whole data set to be imported
in a loop do:
1 import the data in chunks of eg 30000 records per chunk and save it
in a temporary data.frame 'data.chunk'
2 the conversion into factors and other preprocessing steps, such as
data aggregation should be done for each single chunk saved in
'data.chunk' after import
3 the now preprocessed chunk is saved into the appropriate part of
the at the beginning defined data.frame 'data.total'
4 whole dataset is imported and data.frame 'data.total' is ready for
further computational steps
in a nutshell: preprocessing steps such as conversion into factors yield
memory troubles, even for data.sets which per se don't take too much
memory- but done separately in smaller chunks of data, it can be done
with R very efficiently. The 'team' MySQL together with R is VERY powerful
Cheers
Christoph
More information about the R-help
mailing list