[R] the large dataset problem
jim holtman
jholtman at gmail.com
Tue Jul 31 03:46:45 CEST 2007
FYI. I used your script on a Windows machine with 1.5GHZ and using
the CYGWIN software that has the UNIX utilities. The field as 1000
lines with 10,000 fields on each line. Here is what it reported:
gawk 'BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) "," $(5678)}'
< tempxx.txt > newdata.csv
real 0m0.806s
user 0m0.640s
sys 0m0.124s
So it took less than a second to process the file, so it still should
be pretty fast on windows. BTW, the first run took 30 seconds of real
time due to the slow disk that I have. The run above had the data
already cached in memory.
On 7/30/07, Ted Harding <ted.harding at nessie.mcc.ac.uk> wrote:
> On 30-Jul-07 11:40:47, Eric Doviak wrote:
> > [...]
>
> Sympathies for the constraints you are operating in!
>
> > The "Introduction to R" manual suggests modifying input files with
> > Perl. Any tips on how to get started? Would Perl Data Language (PDL) be
> > a good choice? http://pdl.perl.org/index_en.html
>
> I've not used SIPP files, but itseems that they are available in
> "delimited" format, including CSV.
>
> For extracting a subset of fields (especially when large datasets may
> stretch RAM resources) I would use awk rather than perl, since it
> is a much lighter program, transparent to code for, efficient, and
> it will do that job.
>
> On a Linux/Unix system (see below), say I wanted to extract fields
> 1, 1000, 1275, .... , 5678 from a CSV file. Then the 'awk' line
> that would do it would look like
>
> awk '
> BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) "," ... $(5678)
> ' < sippfile.csv > newdata.csv
>
> Awk reads one line at a tine, and does with it what you tell it to do.
> It will not be overcome by a file with an enormous number of lines.
> Perl would be similar. So long as one line fits comfortably into RAM,
> you would not be limited by file size (unless you're running out
> of disk space), and operation will be quick, even for very long
> lines (as an experiment, I just set up a file with 10,000 fields
> and 35 lines; awk output 6 selected fields from all 35 lines in
> about 1 second, on the 366MHz 128MB RAM machine I'm on at the
> moment. After transferring it to a 733MHz 512MB RAM machine, it was
> too quick to estimate; so I duplicated the lines to get a 363-line
> file, and now got those same fields out in a bit less than 1 second.
> So that's over 300 lines/second, 200,000 lines a minute, a million
> lines in 5 minutes; and all on rather puny hardware.).
>
> In practice, you might want to write a separate script which woould
> automatically create the necessary awk script (say if you supply
> the filed names, haing already coded the filed positions corresponding
> to filed names). You could exploit R's system() command to run the
> scripts from within R, and then load in the filtered data.
>
> > I wrote a script which loads large datasets a few lines at a time,
> > writes the dozen or so variables of interest to a CSV file, removes
> > the loaded data and then (via a "for" loop) loads the next few lines
> > .... I managed to get it to work with one of the SIPP core files,
> > but it's SLOOOOW.
>
> See above ...
>
> > Worse, if I discover later that I omitted a relevant variable,
> > then I'll have to run the whole script all over again.
>
> If the script worked quickly (as with awk), presumably you
> wouldn't mind so much?
>
> Regarding Linux/Unix versus Windows. It is general experience
> that Linux/Unix works faster, more cleanly and efficiently, and
> often more reliably, for similar tasks; and cam do so on low grade
> hardware. Also, these systems come with dozens of file-processing
> utilities (including perl and awk; also many others), each of which
> has been written to be efficient at precisely the repertoire of
> tasks it was designed for. A lot of Windows sotware carries a huge
> overhead of either cosmetic dross, or a pantechnicon of functionality
> of which you are only going to need 0.01% at any one time.
>
> The Unix utilities have been ported to Windows, long since, but
> I have no experience of using them in that environment. Others,
> who have, can advise! But I'd seriously suggest getting hold of them.
>
> Hoping this helps,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 30-Jul-07 Time: 18:24:41
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem you are trying to solve?
More information about the R-help
mailing list