[R] Principle components analysis on a large dataset

Fri Aug 21 08:55:36 CEST 2009

Moshe Olshansky <m_olshansky <at> yahoo.com> writes:

> 
> Hi Misha,
> 
> Since PCA is a linear procedure and you have only 6000 observations, you do
not need 68000 variables. Using
> any 6000 of your variables so that the resulting 6000x6000 matrix is
non-singular will do. You can choose
> these 6000 variables (columns) randomly, hoping that the resulting matrix is
non-singular (and
> checking for this). Alternatively, you can try something like choosing one
"nice" column, then choosing
> the second one which is the mostly orthogonal to the first one (kind of
Gram-Schmidt), then choose the
> third one which is mostly orthogonal to the first two, etc. (I am not sure how
much rounoff may be a problem-
> try doing this using higher precision if you can). Note that you do not need
to load the entire 6000x68000
> matrix into memory (you can load several thousands of columns, proc
>  ess them and discard them).
> Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries,
which can fit into a memory and you
> can perform the usual PCA on this matrix.
> 
> Good luck!
> 
> Moshe.
> 
> P.S. I am curious to see what other people think.
> 
I think this will give you *a* principal component analysis, but it won't give
you *the* principal component analysis in the sense that the first principal
component would account for a certain proportion of the total variance etc. If
you try this, you see that each random sample will have different eigenvalues,
different proportions of eigenvalues and different sum of all eigenvalues like
you would expect for different data sets.

I even failed to create the raw data matrix of dimensins 68000 x 6000 (Error:
cannot allocate vector of size 3.0 Gb).

Cheers, Jari Oksanen