[R] Using huge datasets
FIVAZ Fabien
Fabien.Fivaz at unine.ch
Wed Feb 4 18:30:05 CET 2004
You were all right. My data, when I load it with scan() just takes about 300MB of memory and I do not have any problem with it. When loaded with scan, it is not yet a matrix, and I can easily convert it to a matrix with matrix(blabla). The problem I have is that I have to convert it to a data frame (I have a mix of numbers and factors). It takes some time but it's OK. But I cannot read or work with the created data frame, it always ends with a seg fault ! I *just* did variable[1] (where variable is the name of my variable :-)), and it returned a seg fault.
Why is there such a difference between matrices and data frames? Is it because data frames store much more informations?
Best wishes, Fabien
-------- Message d'origine--------
De: Liaw, Andy [mailto:andy_liaw at merck.com]
Date: mer. 04.02.2004 17:11
À: FIVAZ Fabien; r-help at stat.math.ethz.ch
Cc:
Objet: RE: [R] Using huge datasets
A matrix of that size takes up just over 320MB to store in memory. I'd
imagine you probably can do it with 2GB physical RAM (assuming your
`columns' are all numeric variables; i.e., no factors).
However, perhaps better way than the brute-force, one-shot way, is to read
in the data in chunks and do the prediction piece by piece. You can use
scan(), or open()/readLines()/close() to do this fairly easily.
My understanding of how (most) clusters work is that you need at least one
node that will accommodate the memory load for the monolithic R process, so
probably not much help. (I could very well be wrong about this. If so, I'd
be very grateful for correction.)
HTH,
Andy
> From: Fabien Fivaz
>
> Hi,
>
> Here is what I want to do. I have a dataset containing 4.2 *million*
> rows and about 10 columns and want to do some statistics with
> it, mainly
> using it as a prediction set for GAM and GLM models. I tried
> to load it
> from a csv file but, after filling up memory and part of the
> swap (1 gb
> each), I get a segmentation fault and R stops. I use R under
> Linux. Here
> are my questions :
>
> 1) Has anyone ever tried to use such a big dataset?
> 2) Do you think that it would possible on a more powerfull
> machine, such
> as a cluster of computers?
> 3) Finaly, does R has some "memory limitation" or does it
> just depend on
> the machine I'm using?
>
> Best wishes
>
> Fabien Fivaz
>
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments,...{{dropped}}
More information about the R-help
mailing list