[R] Large data sets with R (binding to hadoop available?)
Thomas Lumley
tlumley at u.washington.edu
Fri Aug 22 18:10:55 CEST 2008
On Thu, 21 Aug 2008, Roland Rau wrote:
> Hi
>
> Avram Aelony wrote: (in part)
>>
>> 1. How do others handle situations of large data sets (gigabytes,
>> terabytes) for analysis in R ?
>>
> I usually try to store the data in an SQLite database and interface via
> functions from the packages RSQLite (and DBI).
>
> No idea about Question No. 2, though.
>
> Hope this helps,
> Roland
>
>
> P.S. When I am sure that I only need a certain subset of large data sets, I
> still prefer to do some pre-processing in awk (gawk).
> 2.P.S. The size of my data sets are in the gigabyte range (not terabyte
> range). This might be important if your data sets are *really large* and you
> want to use sqlite: http://www.sqlite.org/whentouse.html
>
I use netCDF for (genomic) datasets in the 100Gb range, with the ncdf
package, because SQLite was too slow for the sort of queries I needed.
HDF5 would be another possibility; I'm not sure of the current status of
the HDF5 support in Bioconductor, though.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list