[R] Memory/data -last time I promise

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Jul 24 17:58:28 CEST 2001

On Tue, 24 Jul 2001, Micheall Taylor wrote:

> I've seen several posts over the past 2-3 weeks about memory issues.  I've
> tried to carefully follow the suggestions, but remain baffled as to why I
> can't load data into R.  I hope that in revisiting this issue that I don't
> exasperate the list.
> The setting:
> 1 gig RAM , Linux machine
> 10 Stata files of approximately 14megs each
> File contents appear at the end of this boorishly long email.
> Purpose:
> load and combine in R for further analysis
> Question:
> 1) I've placed memory queries in the command file to see what is going on.
> It appears that loading a 14meg file consumes approx 5 times this amount of
> memory - i.e. available memory declines by 70megs when a 14 meg dataset is
> loaded. (Seen in Method 2 below)

That's quite possible.  A `14Mb dataset' is not too helpful to us.  You
seem to have one char (ca 2 chars) and 9 numeric variables per record.
That's ca 75 bytes per record.  An actual experiment and using object.size
gives 88 (there are row names too).  So at 70Mb, that is about 0.8M rows.
If that's not right, the data are not being read in correctly.

The main problem I see is that your machine seems unable to allocate more
than about 450Mb to R, and it has surprisingly little swap space.  (This
512Mb Linux machine has 1Gb of swap allocated, and happily allocates 800Mb
to R when needed.)

> 2) Ultimately I would like to replace Stata with R, but the Stata datasets
> I frequently use are in the 100s of megs, which work fine on this machine.
> Is R capable of this?

Probably not.  R does require objects to be stored in memory.

As a serious statistical question: what can you usefully do with 8M rows
on 9 continuous variables?  Why would a 1% sample not be already far more
than enough?  My group regularly works with datasets in the 100s of Mb,
but normally we either sample or we summarize in groups for further
analysis.  Our latest dataset is a 1.2Gb Oracle table, but it has
structure (it's 60 experiments for a start).


BTW, rbind is inefficient, but adding a piece at time is the least
efficient way to use it.  rbind(full1, full2, ..., full10) would be
better.  Allocating full and assigning to sub-sections would be better

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list