[R] R tools for large files
Murray Jorgensen
maj at stats.waikato.ac.nz
Tue Aug 26 01:45:28 CEST 2003
I would like to thank those who have responded and especially Brian
Ripley for making his unix tools for Windows available. A colleague has
also mentioned to me the set of unix tools called Cygwin.
Two things that can be done with R alone are to read the first n lines
of a file into n strings with readLines() and to scan in a block of the
file after skipping a number of lines.
I will probably use Fortran to extract subsets of the file as I need to
use it for other things that I am planning to do with the file.
I'll maybe also play a bit with readLines() and writeLines() inside
loops to see if I can build up my random subsets of files this way.
BTW, I now estimate the file at about 100,000 lines so indeed, it is not
all that large!
Murray Jorgensen
Prof Brian Ripley wrote:
> On Mon, 25 Aug 2003, Murray Jorgensen wrote:
>
>
>>At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote:
>>
>>>I think that is only a medium-sized file.
>>
>>"Large" for my purposes means "more than I really want to read into memory"
>>which in turn means "takes more than 30s". I'm at home now and the file
>>isn't so I'm not sure if the file is large or not.
>>
>>More responses interspesed below. BTW, I forgot to mention that I'm using
>>Windows and so do not have nice unix tools readily available.
>
>
> But you do, thanks to me, as you need them to installed R packages.
>
>
>>>On Mon, 25 Aug 2003, Murray Jorgensen wrote:
>>>
>>>
>>>>I'm wondering if anyone has written some functions or code for handling
>>>>very large files in R. I am working with a data file that is 41
>>>>variables times who knows how many observations making up 27MB altogether.
>>>>
>>>>The sort of thing that I am thinking of having R do is
>>>>
>>>>- count the number of lines in a file
>>>
>>>You can do that without reading the file into memory: use
>>>system(paste("wc -l", filename))
>>
>>Don't think that I can do that in Windows XL.
>
>
> I presume you mean Windows XP? Of course you can, and wc.exe is in
> Rtools.zip!
>
>
>>or read in blocks of lines via a
>>
>>>connection
>>
>>But that does sound promising!
>>
>>
>>>>- form a data frame by selecting all cases whose line numbers are in a
>>>>supplied vector (which could be used to extract random subfiles of
>>>>particular sizes)
>>>
>>>R should handle that easily in today's memory sizes. Buy some more RAM if
>>>you don't already have 1/2Gb. As others have said, for a real large file,
>>>use a RDBMS to do the selection for you.
>>
>>It's just that R is so good in reading in initial segments of a file that I
>>can't believe that it can't be effective in reading more general
>>(pre-specified) subsets.
>
>
--
Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz Fax 7 838 4155
Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 1395 862
More information about the R-help
mailing list