[R] vsize and nsize
Prof Brian D Ripley
ripley at stats.ox.ac.uk
Tue May 18 19:38:09 CEST 1999
On Tue, 18 May 1999, Thomas Lumley wrote:
> On Tue, 18 May 1999, Jim Lindsey wrote:
>
> > I am wondering what you mean by "R's poor handling of large datasets".
> > How large is large? I have often been working simultaneously with a
> > fair number of vectors of say 40,000 using my libraries (data objects
> > and functions) with no problems. They use the R scoping rules. On the
> > other hand, if you use dataframes and/or standard functions like glm,
> > then you are restricted to extremely small (toy) data sets. But then
> > maybe you are thinking of gigabytes of data.
>
> While I agree that R isn't much use for really big datasets, I am a bit
> surprised by Jim's comments about glm(). I have used R (including glm)
> perfectly satisfactorily on real data sets of ten thousand or so
> observations. This isn't large by today's standards but it isn't in
> any sense toy data.
Me too, and I have done this in S as well (often with a lot less memory
usage than R). I don't believe scoping rules have anything to do with this
(and glm uses R's scoping rules as well: it is hard to use anything else in
R!), but how code is written does. Bill Venables and I had various
discussions at the last S-PLUS users' conference about whether one could
ever justify using regressions, say, with more than 10,000 cases. Points
against include
- such datasets are unlikely to be homogeneous and better analysed in
natural strata.
- statistical significance is likely to occur at practically insignificant
levels of effects.
and the `for' points include
- it may be a legal requirement to use all the data,
- datasets can be very unbalanced, as in 70,000 normals and 68 diseased
(but then one can sub-sample the normals).
However, that is for statistical analysis, not dataset manipulations.
I think I started this by quoting Ross as saying that R is not designed for
large datasets (and neither was S version 3). Large was in the context of a
100Mb heap and 80Mb ncells space, which I think answers Jim's question (go
up a couple of orders of magnitude). Remember that the S developers came
from the Unix `tools' background and said they expected tools such as awk
to be used to manipulate datasets. These days we (and probably them) prefer
more help.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list