[R] Significance of Principal Coordinates

Tue Mar 15 17:38:59 CET 2005

On Mon, 2005-03-14 at 18:32 +0100, Christian Kamenik wrote:
> Dear all,
> 
> I was looking for methods in R that allow assessing the number of  
> significant principal coordinates. Unfortunatly I was not very 
> successful. I expanded my search to the web and Current Contents, 
> however, the information I found is very limited.
> Therefore, I tried to write code for doing a randomization. I would 
> highly appriciate if somebody could comment on the following approach. I 
> am neither a statistician, nor an R expert... the data matrix I used has 
> 72 species (columns) and 167 samples (rows).
> 
Earlier this year (Sat, 29 Jan 2005) Jérôme Lemaître asked something
similar here under subject "Bootstrapped eigenvector" (but the code I
posted then had one bug I know and perhaps some I don't know!). Some
ecologists (Donald Jackson, Peres-Neto) have indeed tried to develop
methods for PCA, and they could be easily modified for PCoA which is
about the same method, in particular with Euclidean distances like you
used. So the following two solutions are practically identical (within
2e-15 in the case I tried):

x <- decostand(x, "norm") # in vegan
chordis <- dist(x) # Euclidean is the default, so this is chord distance
pcoa <- cmdscale(chordis)
pca <- prcomp(x)

Verify this with:

procrustes(pcoa, pca, choices=1:2) # in vegan

PCoA with row weights is something different, but I really don't know
why would you like to do this. I really don't understand what people
mean with "significant" eigenvalues, unless they are making Factor
Analysis. In PCA, you rotate your data, and you can find low-rank
approximations of your data, but how these are rotatations are
"significant" is beyond my imagination. Further, resampling with
replacement seems to suit poorly to multivariate analysis: it duplicates
some rows and so it makes easier to find similar rows that is the
ultimate task in PC rotation. It seems that Monte Carlo results are
systematically "better" than any original data (only if number of rows
is much lower than  number of columns this is not disturbing). Also,
resampling or shuffling species tends to create communities that are
fundamentally different from any real community we have: instead of
single or a few abundant species, they may have several or none. With
total abundance constraint you can hide the traces of anarchistic
community assembly, but not its fundamental fault. So I do think that
(1) you cannot use resampling in assessing PCA and its kin, (2) you
cannot say what is the meaning of being "significant" in this case, and
(3) the number of "significant" axes would only be a function of sample
size even here.

Now my hope is that some guru over there gets so irritated that (s)he
chastises me for writing such pieces of stupidity, and sends a correct
solution here with accompanying code and references to the literature.
Let's hope so.

The old truth is that most data sets have 2.5 dimensions (Kruskal):
those two that you can show in a printed plot, and that half a dimension
that you must explain away in the text. Wouldn't that be a sufficient
solution?

cheers, jari oksanen
-- 
Jari Oksanen <jarioksa at sun3.oulu.fi>