[R] Can R handle a matrix with 8 billion entries?
Prof Brian Ripley
ripley at stats.ox.ac.uk
Wed Aug 10 07:16:43 CEST 2011
On Wed, 10 Aug 2011, David Winsemius wrote:
>
> On Aug 9, 2011, at 11:38 PM, Chris Howden wrote:
>
>> Hi,
>>
>> I’m trying to do a hierarchical cluster analysis in R with a Big Data set.
>> I’m running into problems using the dist() function.
>>
>> I’ve been looking at a few threads about R’s memory and have read the
>> memory limits section in R help. However I’m no computer expert so I’m
>> hoping I’ve misunderstood something and R can handle my Big Data set,
>> somehow. Although at the moment I think my dataset is simply too big and
>> there is no way around it, but I’d like to be proved wrong!
>>
>> My data set has 90523 rows of data and 24 columns.
>>
>> My understanding is that this means the distance matrix has a min of
>> 90523^2 elements which is 8194413529. Which roughly translates as 8GB of
A bit less than half that: it is symmetric.
>> memory being required (if I assume each entry requires 1 bit).
Hmm, that would be a 0/1 distance: there are simpler methods to
cluster such distances.
>> I only have 4GB on a 32bit build of windows and R. So there is no
>> way that’s going to work.
>>
>> So then I thought of getting access to a more powerful computer, and maybe
>> using cloud computing.
>>
>> However the R memory limit help mentions “On all builds of R, the maximum
>> length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now as the
>> distance matrix I require has more elements than this does this mean it’s
>> too big for R no matter what I do?
>
> Yes. Vector indexing is done with 4 byte integers.
Assuming you need the full distance matrix at one time (which you do
not for hierarchical clustering, itself a highly dubious method for
more than a few hundred points).
>
> --
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list