[R] summary.manova rank deficiency error + data

Thu Aug 14 09:34:20 CEST 2008

Peter Dalgaard wrote:
> Pedro Mardones wrote:
>> Thanks for the reply. The SAS output is attached but seems to me that
>> doesn't correspond to the wihtin-row contrasts as you suggested. By
>> the way, yes the data are highly correlated, in fact each row
>> correspond to the first part of a signal vector. Thanks anyway....
>> PM
>>   
> Agreed. I tried disabling the check that causes R to protest, and then 
> it gives similar DF but not quite the same statistics, quite possibly 
> due to numerical instabilities in one or both systems. (You can easily 
> try yourself, just do anova.mlm <- stats::anova.mlm and edit the qr() 
> call inside.)
>
> > anova(lm(cbind(Y1,Y2,Y3,Y4,Y5)~GROUP, test), test = "Wilks")
> Analysis of Variance Table
>
>            Df    Wilks approx F num Df den Df Pr(>F)   (Intercept)  1 
> 0.002537  1887.24      5     24 <2e-16 ***
> GROUP        2     0.62     1.29     10     48 0.2616   Residuals   
> 28                                          ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
Actually, a better strategy is to transform the responses so as to 
reduce the correlation. This works without messing with the tolerances

 > anova(lm(cbind(Y1,cbind(Y2,Y3,Y4,Y5)-Y1)~GROUP, test), test = "Wilks")
Analysis of Variance Table

            Df    Wilks approx F num Df den Df Pr(>F)   
(Intercept)  1 0.002537  1887.25      5     24 <2e-16 ***
GROUP        2     0.62     1.29     10     48 0.2616   
Residuals   28                                          
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You might want to try the same thing in SAS. (Also, check that the data 
are actually the same. With this level of correlation, last-digit 
differences can be influential.)

>
>>
>>                                       The GLM Procedure
>>                               Multivariate Analysis of Variance
>>                                     E = Error SSCP Matrix
>>                    y1                y2                y3
>>   y4                y5
>>  y1      0.0353518799       0.035256904      0.0351327804
>> 0.0349749601      0.0347868018
>>  y2       0.035256904      0.0351627227      0.0350395053
>> 0.0348827098      0.0346956744
>>  y3      0.0351327804      0.0350395053      0.0349173343
>> 0.0347617352      0.0345760232
>>  y4      0.0349749601      0.0348827098      0.0347617352
>> 0.0346075203      0.0344233531
>>  y5      0.0347868018      0.0346956744      0.0345760232
>> 0.0344233531      0.0342409225
>>
>> Partial Correlation Coefficients from the Error SSCP Matrix / Prob > |r|
>>       DF = 28             y1             y2             y3
>> y4             y5
>>       y1            1.000000       0.999992       0.999967
>> 0.999921       0.999852
>>                                      <.0001         <.0001
>> <.0001         <.0001
>>       y2            0.999992       1.000000       0.999991
>> 0.999963       0.999911
>>                       <.0001                        <.0001
>> <.0001         <.0001
>>       y3            0.999967       0.999991       1.000000
>> 0.999990       0.999958
>>                       <.0001         <.0001
>> <.0001         <.0001
>>       y4            0.999921       0.999963       0.999990
>> 1.000000       0.999989
>>                       <.0001         <.0001         <.0001
>>            <.0001
>>       y5            0.999852       0.999911       0.999958
>> 0.999989       1.000000
>>                       <.0001         <.0001         <.0001         
>> <.0001
>>
>> The SAS System     10:33 Wednesday, August 13, 2008   8
>>                                       The GLM Procedure
>>                               Multivariate Analysis of Variance
>>                               H = Type III SSCP Matrix for group
>>                    y1                y2                y3
>>   y4                y5
>>  y1      0.0023822408       0.002365848      0.0023471328
>> 0.0023261249      0.0023030993
>>  y2       0.002365848      0.0023495679      0.0023309816
>> 0.0023101183      0.0022872511
>>  y3      0.0023471328      0.0023309816      0.0023125426
>> 0.0022918453      0.0022691608
>>  y4      0.0023261249      0.0023101183      0.0022918453
>> 0.0022713359      0.0022488593
>>  y5      0.0023030993      0.0022872511      0.0022691608
>> 0.0022488593      0.0022266141
>>
>>                  Characteristic Roots and Vectors of: E Inverse * H, 
>> where
>>                               H = Type III SSCP Matrix for group
>>                                     E = Error SSCP Matrix
>> Characteristic           Characteristic Vector  V'EV=1
>>           Root  Percent            y1            y2            y3
>>       y4            y5
>>     0.41840103    71.72     -7542.628     17131.814      5347.394
>> -31627.317     16700.100
>>     0.16496011    28.28     -4180.854     -4413.446     32096.035
>> -35545.204     12040.697
>>     0.00000001     0.00    -41004.875    107291.004    -95905.664
>> 32641.189     -3028.470
>>     0.00000000     0.00      -416.226      -111.206       410.721
>>  295.193      -171.953
>>     0.00000000     0.00    -14678.651      5787.997     54718.250
>> -69055.249     23218.580
>>
>>    MANOVA Test Criteria and F Approximations for the Hypothesis of No
>> Overall group Effect
>>                              H = Type III SSCP Matrix for group
>>                                     E = Error SSCP Matrix
>>                                      S=2    M=1    N=11
>>        Statistic                        Value    F Value    Num DF
>> Den DF    Pr > F
>>        Wilks' Lambda               0.60518744       1.37        10
>>    48    0.2227
>>        Pillai's Trace              0.43658228       1.40        10
>>    50    0.2095
>>        Hotelling-Lawley Trace      0.58336114       1.37        10
>> 33.362    0.2385
>>        Roy's Greatest Root         0.41840103       2.09         5
>>    25    0.1000
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Aug 13, 2008 at 4:34 AM, Peter Dalgaard
>> <p.dalgaard at biostat.ku.dk> wrote:
>>  
>>> Pedro Mardones wrote:
>>>    
>>>> Dear R-users;
>>>>
>>>> Previously I posted a question about the problem of rank deficiency in
>>>> summary.manova. As somebody suggested, I'm attaching a small part of
>>>> the data set.
>>>>
>>>> #***************************************************
>>>>
>>>> "test" <-
>>>>
>>>> structure(.Data = list(structure(.Data = 
>>>> c(rep(1,3),rep(2,18),rep(3,10)),
>>>> levels = c("1", "2", "3"),
>>>> class = "factor")
>>>>
>>>>
>>>> ,c(0.181829,0.090159,0.115824,0.112804,0.134650,0.249136,0.163144,0.122012,0.157554,0.126283, 
>>>>
>>>>
>>>> 0.105344,0.125125,0.126232,0.084317,0.092836,0.108546,0.159165,0.121620,0.142326,0.122770, 
>>>>
>>>>
>>>> 0.117480,0.153762,0.156551,0.185058,0.161651,0.182331,0.139531,0.188101,0.103196,0.116877,0.113733) 
>>>>
>>>>
>>>>
>>>> ,c(0.181445,0.090254,0.115840,0.112863,0.134610,0.249003,0.163116,0.122135,0.157206,0.126129, 
>>>>
>>>>
>>>> 0.105302,0.124917,0.126243,0.084455,0.092818,0.108458,0.158769,0.121244,0.141981,0.122595, 
>>>>
>>>>
>>>> 0.117556,0.153507,0.156308,0.184644,0.161421,0.181999,0.139376,0.187708,0.103126,0.116615,0.113746) 
>>>>
>>>>
>>>>
>>>> ,c(0.181058,0.090426,0.115926,0.113022,0.134632,0.248845,0.163140,0.122331,0.156871,0.126023, 
>>>>
>>>>
>>>> 0.105335,0.124757,0.126325,0.084690,0.092885,0.108455,0.158386,0.120913,0.141676,0.122492, 
>>>>
>>>>
>>>> 0.117707,0.153293,0.156095,0.184242,0.161214,0.181670,0.139271,0.187318,0.103129,0.116421,0.113826) 
>>>>
>>>>
>>>>
>>>> ,c(0.180692,0.090704,0.116110,0.113319,0.134745,0.248678,0.163256,0.122637,0.156581,0.125998, 
>>>>
>>>>
>>>> 0.105479,0.124686,0.126514,0.085066,0.093088,0.108587,0.158040,0.120674,0.141446,0.122488, 
>>>>
>>>>
>>>> 0.117972,0.153150,0.155954,0.183885,0.161063,0.181383,0.139251,0.186956,0.103232,0.116351,0.114001) 
>>>>
>>>>
>>>>
>>>> ,c(0.180353,0.091088,0.116392,0.113753,0.134965,0.248520,0.163475,0.123046,0.156354,0.126067, 
>>>>
>>>>
>>>> 0.105726,0.124713,0.126821,0.085584,0.093432,0.108858,0.157742,0.120533,0.141309,0.122595, 
>>>>
>>>>
>>>> 0.118340,0.153088,0.155897,0.183582,0.160975,0.181143,0.139314,0.186636,0.103449,0.116415,0.114275) 
>>>>
>>>> )
>>>> ,names = c("GROUP", "Y1", "Y2", "Y3", "Y4","Y5")
>>>> ,row.names = seq(1:31)
>>>> ,class = "data.frame"
>>>> )
>>>>
>>>> summary(manova(cbind(Y1,Y2,Y3,Y4,Y5)~GROUP, test), test = "Wilks")
>>>>
>>>> #Error in summary.manova(manova(cbind(Y1, Y2, Y3, Y4, Y5) ~ GROUP, 
>>>> test),
>>>>  :
>>>>  residuals have rank 3 < 5
>>>>
>>>> #***************************************************
>>>>
>>>> What I don't understand is why SAS returns no errors using PROC GLM
>>>> for the same data set. Is because PROC GLM doesn't take into account
>>>> problems of rank deficiency? So, should I trust manova instead of PROC
>>>> GLM output? I know it can be a touchy question but I would like to
>>>> receive some insights.
>>>> Thanks
>>>> PM
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>       
>>> What you have here is extremely correlated data:
>>>
>>>    
>>>> (V <- estVar(lm(cbind(Y1,Y2,Y3,Y4,Y5)~GROUP, test)))
>>>>       
>>> Y1 Y2 Y3 Y4 Y5
>>> Y1 0.001262567 0.001259177 0.001254746 0.001249106 0.001242385
>>> Y2 0.001259177 0.001255814 0.001251416 0.001245812 0.001239132
>>> Y3 0.001254746 0.001251416 0.001247055 0.001241494 0.001234861
>>> Y4 0.001249106 0.001245812 0.001241494 0.001235983 0.001229405
>>> Y5 0.001242385 0.001239132 0.001234861 0.001229405 0.001222889
>>>    
>>>> eigen(V)
>>>>       
>>> $values
>>> [1] 6.224077e-03 2.313066e-07 3.499837e-10 4.259125e-12 1.334146e-12
>>>
>>> $vectors
>>> [,1] [,2] [,3] [,4] [,5]
>>> [1,] 0.4503756 0.61213579 0.5204920 -0.3485941 0.1732681
>>> [2,] 0.4491807 0.32333236 -0.1873653 0.5929444 -0.5540795
>>> [3,] 0.4476157 0.01442094 -0.5498688 0.1272921 0.6934503
>>> [4,] 0.4456201 -0.31202109 -0.3198606 -0.6557557 -0.4144143
>>> [5,] 0.4432397 -0.65052351 0.5378809 0.2840428 0.1017918
>>>
>>> Notice the more than 9 orders of magnitude between the eigenvalues.
>>>
>>> I think that what is happening is that what SAS calls MANOVA is 
>>> actually
>>> looking at within-row contrasts, which effectively removes the largest
>>> eigenvalue. In R, the equivalent would be
>>>
>>>    
>>>> anova(lm(cbind(Y1,Y2,Y3,Y4,Y5)~GROUP, test), X=~1, test = "Wilks")
>>>>       
>>> Analysis of Variance Table
>>>
>>>
>>> Contrasts orthogonal to
>>> ~1
>>>
>>> Df Wilks approx F num Df den Df Pr(>F)
>>> (Intercept) 1 0.037 164.873 4 25 <2e-16 ***
>>> GROUP 2 0.701 1.215 8 50 0.3098
>>> Residuals 28
>>> ---
>>> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>>
>>> or (this could be computationally more precice, but in fact it gives 
>>> the
>>> same result)
>>>
>>>    
>>>> anova(lm(cbind(Y2,Y3,Y4,Y5)-Y1~GROUP, test), test = "Wilks")
>>>>       
>>> -- 
>>>  O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>>>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>>> (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
>>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 
>>> 35327907
>>>
>>>
>>>     
>
>

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907