[R] R's memory limitation and Hadoop

Tue Sep 16 21:14:05 CEST 2014

> [*] I recall a student fitting a GLM with about 30 predictors to 1.5m
> records: at the time (ca R 2.14) it did not fit in 4GB but did in 8GB.

You can easily run out of memory when a few of the variables are
factors, each with many levels, and the user looks for interactions
between them.  This can happen by accident if your data was imported
with read.table() and a variable meant to be numeric was read as
factor (or character).  str(yourData) would tell you about this
problem.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Sep 16, 2014 at 11:47 AM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
> On 16/09/2014 13:56, peter dalgaard wrote:
>>
>> Not sure trolling was intended here.
>>
>> Anyways:
>>
>> Yes, there are ways of working with very large datasets in R, using
>> databases or otherwise. Check the CRAN task views.
>>
>> SAS will for _some_ purposes be able to avoid overflowing RAM by using
>> sequential file access. The biglm package is an example of using similar
>> techniques in R. SAS is not (to my knowledge) able to do this invariably,
>> some procedures may need to load the entire data set into RAM.
>>
>> JMP's data tables are limited by available RAM, just like R's are.
>>
>> R does have somewhat inefficient memory strategies (e.g., model matrices
>> may include multiple columns of binary variables, each using 8 bytes per
>> entry), so may run out of memory sooner than other programs, but it is not
>> like the competition is not RAM-restricted at all.
>
>
> Also 'hundreds of thousands of records' is not really very much: I have seen
> analyses of millions many times[*]: I have analysed a few billion with 0.3TB
> of RAM.
>
> [*] I recall a student fitting a GLM with about 30 predictors to 1.5m
> records: at the time (ca R 2.14) it did not fit in 4GB but did in 8GB.
>
>> - Peter D.
>>>
>>>
>>> On September 16, 2014 4:40:29 AM PDT, Barry King <barry.king at qlx.com>
>>> wrote:
>>>>
>>>> Is there a way to get around R’s memory-bound limitation by interfacing
>>>> with a Hadoop database or should I look at products like SAS or JMP to
>>>> work
>>>> with data that has hundreds of thousands of records?  Any help is
>>>> appreciated.
>>>>
>>>> --
>>>> __________________________
>>>> *Barry E. King, Ph.D.*
>>>> Analytics Modeler
>>>> Qualex Consulting Services, Inc.
>>>> Barry.King at qlx.com
>>>> O: (317)940-5464
>>>> M: (317)507-0661
>>>> __________________________
>
>
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Emeritus Professor of Applied Statistics, University of Oxford
> 1 South Parks Road, Oxford OX1 3TG, UK
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.