[R] cor() alternative for huge data set
    Peter Langfelder 
    peter.langfelder at gmail.com
       
    Thu Sep 30 04:05:44 CEST 2010
    
    
  
On Wed, Sep 29, 2010 at 1:27 PM, Jyotasana Gulati <jgulati at ice.mpg.de> wrote:
> Hi,
>
> I am have a data set of around 43000 probes(rows), and have to calculate correlation matrix. When I run cor function in R, its throwing an error message of RAM shortage which was obvious for such huge number of rows.  I am not getting a logical way to cut off this huge number of entities, is there an alternative to pearson correlation or with other dist() methods calculation(euclidean) that can be run on such a huge data set??
> Every help will be appreciated.
Hmm... Are you calculating a correlation of 43000 probes, or of some
number of samples across 43000 probes? If the former, read below. If
the latter, I'm surprised you are running out of memory. Issuing
garbage collection (gc()) before the calculation, closing all other
programs, removing all other large objects from the R workspace etc.
may help.
If you really need the 43k times 43k correlation matrix of your 43k
probes, read on.
[Disclosure: this is a shameless plug for the package WGCNA (Weighted
Gene Co-expression Network Analysis, also known as Weighted
Correlation Network Analysis), from the package author, namely me.]
First, since the distance matrix will be huge, you will not gain using
other distance methods either.
Second, depending on what you want to do with the 43k probes, the
package WGCNA may help you. It has methods for creating correlation
networks among a large number of probes. The idea is to pre-cluster
the probes using what I call projective K-means, function
projectiveKMeans. The pre-clustering will return what we call blocks
of probes (or genes). We assume (this is a big assumption) that
correlations among probes belonging to different blocks can be
neglected. Then we treat each block separately for network
construction (or, in your case, possibly simple calculation of
correlation).
Although this isn't strictly an R topic but rather microarray analysis
issue, in my experience it is often useful to filter out probes before
actually calculating and interpreting large correlation matrices. In
conjunction with filtering, it can be advantageous to only keep one
probe per gene (presumably there is more than one probe per gene in
you data set). The filtering criterion varies from analysis to
analysis, but if your data represent intensities, it is often a good
idea to throw away probes whose intensity is always low, because such
signals are mostly noise.
If you decide to check out WGCNA, look at
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/.
Peter
    
    
More information about the R-help
mailing list