[R] How to import BIG csv files with separate "map"?
Steve Lianoglou
mailinglist.honeypot at gmail.com
Tue Jul 14 21:50:27 CEST 2009
Hi,
On Jul 14, 2009, at 1:53 PM, giusto wrote:
>
> Hi all,
>
> I am having problems importing a VERY large dataset in R. I have
> looked into
> the package ff, and that seems to suit me, but also, from all the
> examples I
> have seen, it either requires a manual creation of the database, or
> it needs
> a read.table kind of step. Being a survey kind of data the file is
> big (like
> 20,000 times 50,000 for a total of about 1.2Gb in plain text) the
> memory I
> have isn't enough to do a read.table and my computer freezes every
> time :(
Look at the documentation near the end of ?read.table:
"""Note that unless colClasses is specified, all columns are read as
character columns and then converted. This means that quotes are
interpreted in all fields and that a column of values like "42" will
result in an integer column."""
So all the data is read in as characters, then R tries to convert it
to the appropriate data type (uses mucho memory).
Perhaps specifying the types of each column in the colClasses param
can get you where you need to be.
> This far I have managed to import the required subset of the data by
> using a
> "cheat": I used GRETL to read an equivalent Stata file (released by
> the same
> source that offered the csv file), manipulate it and export it in a
> format
> that R can read into memory.
I'm not sure if you're suggesting that R can read in the whole data
file when stored in some SPSS binary format. If so, perhaps the
colClass trick above might work.
If the read.table w/ colClasses doesn't work (and you know you can
load the entire dataset into R via some binary format), perhaps you'll
have to parse the file line by line by opening it with a "file(..,
'r')" command, and using "scan" (or readChar?) to run through the file
w/o having to load it all into memory at once.
HTH,
-steve
--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the R-help
mailing list