[R] NMDS with missing data?
David et. al.:
I hate to be a pest but ...
First, Bert is correct. I should have said to use prcomp(dat, center=TRUE, scale=TRUE). That will run the svd on the standardized variables which is equivalent to using princomp(dat, cor=TRUE).
***You will have to remove the cases with missing variables or impute
the missing variables using one of many options in R. ***
Depending on the number of missings and nature of the missingness,
this can be a crucial issue. Omitting all data with missing entries
makes very strong assumptions about the nature of the missingness and
can lead to highly biased results. Which is problematic for
exploration, even. The same is true with imputation -- you need to do
it properly. Again, depending on the number of cases at issue.
So it may be wise for Elizabeth to consult a local statistical expert
and not rely on superficial background from a text and remote advice.
There may be dragons ...
Cheers,
Bert
The principal() function in package psych should be fine and will probably give nearly identical results. It does have the ability to generate a pairwise-deletion correlation matrix so you could include your cases with missing values. I would set rotate="none" least initially. Hopefully your text will explain why this is a good idea.
I assume you are looking for interesting patterns in the data rather than trying to test a specific hypothesis. Given that, you should try both (or all three with principal()) and see if there are any interesting differences between them.
Earlier I asked if all your variables are numeric (or dichotomies). If any are categorical (factors), these suggestions may have to be revised.
>
> Just wanted to note that one does **not** use
> "prcomp() on the correlation matrix of the variables."
>
> As ?prcomp says, it uses the svd of the data matrix, which is
> generally preferable.
>
> Cheers,
> Bert
>
>> Hello David,
>>
>> Yes my variables are all numeric....I have a few questions regarding your 2
>> options.
>>
>> Would these still be the best options if missing data was not an issue? I
>> was told that I should be performing NMDS as it has few assumptions on the
>> data distribution but neither of your options use this.
>>
>> If NMDS is not preferred and I were to perform a PCA, can you tell me why
>> you chose prcomp()? My statistical text (Discovering Statistics Using R)
>> explains PCA quite well using principal() in the psych package so I am just
>> wondering the advantages of one over the other... I am overwhelmed by the
>> number of ordination methods!
>>
>> Thank you,
>> Elizabeth
>>
>>
>>> First. Do not use html messages. They are converted to plain text and your
>>> table ends up a mess. See below. It appears the variables are all numeric?
>>> If so, there are two standard approaches to handling multiple scales and
>>> magnitudes with cluster analysis:
>>>
>>> 1. Use z-scores. The scale() function will convert each variable into a
>>> standard score with a mean of 0 and a standard deviation of 1. Then use
>>> Euclidean distance in the dist() function which will adjust for your
>>> missing
>>> values.
>>>
>>> 2. Use prcomp() on the correlation matrix of the variables to extract a set
>>> of principal components and use the principal component scores in the
>>> cluster analysis. This may allow you to reduce the number of variables in
>>> the data set if the 29 variables are correlated with one another.
>>>
>>>
>>> Hi David,
>>>
>>> You are right in that Bray-Curtis is not suitable for my dataset, and that
>>> my variables are very different. Given your suggestions, I am struggling
>>> with how to transform or standardize my data given that they vary so much.
>>> Additionally, looking at the dist() package I am not sure which distance
>>> measure would be most appropriate. Euclidean seems to most widely used but
>>> I'm not sure if it is appropriate for myself (there much more help for
>>> ecology data than toxicology). Given a sample of my data below ( total of
>>> 287 obs. of 29 variables) can you suggest a starting point?
>>>
>>> Thank you!
>>> Elizabeth
>>>
>>> Hi,
>>> I'm trying to run NMDS (non-metric multidimensional scaling) with R vegan
>>> (metaMDS) but I have a few NAs in my data set. I've tried to run it 2 ways.
>>>
>>> The first way with my entire data set which includes variables such as ID,
>>> sex, exposure, treatment, sodium, potassium, chloride....
>>>
>>> mydata.mds<-metaMDS(dat)
>>>
>>> I get the following error:
>>>
>>> in if (any(autotransform, noshare > 0, wascores) && any(comm < 0)) { :
>>> missing value where TRUE/FALSE needed
>>> In addition: Warning messages:
>>> 1: In Ops.factor(left, right) : < not meaningful for factors
>>> 2: In Ops.factor(left, right) : < not meaningful for factors
>>> 3: In Ops.factor(left, right) : < not meaningful for factors
>>> 4: In Ops.factor(left, right) : < not meaningful for factors
>>> 5: In Ops.factor(left, right) : < not meaningful for factors
>>>
>>> The second way with only those last biochemical variables (29 in total).
>>>
>>> mydata.mds<-metaMDS(measurements)
>>>
>>> I get this error:
>>>
>>> Error in if (any(autotransform, noshare > 0, wascores) && any(comm < 0)) {
>>> :
>>> missing value where TRUE/FALSE needed
>>>
>>> My go to "na.rm=TRUE" does nothing. Any ideas on how to account for NAs and
>>> if so which of the above options I should be using?
>>> Thanks!
>>> Elizabeth
