[R] [correction] Animal Morphology: Deriving Classification Equation with

Sun May 24 22:49:53 CEST 2009

[Apologies -- I made an error (see at [***] near the end)]

On 24-May-09 19:07:46, Ted Harding wrote:
> [Your data and output listings removed. For comments, see at end]
> 
> On 24-May-09 13:01:26, cdm wrote:
>> Fellow R Users:
>> I'm not extremely familiar with lda or R programming, but a recent
>> editorial review of a manuscript submission has prompted a crash
>> course. I am on this forum hoping I could solicit some much needed
>> advice for deriving a classification equation.
>> 
>> I have used three basic measurements in lda to predict two groups:
>> male and female. I have a working model, low Wilk's lambda, graphs,
>> coefficients, eigenvalues, etc. (see below). I adjusted the sample
>> analysis for Fisher's or Anderson's Iris data provided in the MASS
>> library for my own data.
>> 
>> My final and last step is simply form the classification equation.
>> The classification equation is simply using standardized coefficients
>> to classify each group- in this case male or female. A more thorough
>> explanation is provided:
>> 
>> "For cases with an equal sample size for each group the classification
>> function coefficient (Cj) is expressed by the following equation:
>> 
>> Cj = cj0+ cj1x1+ cj2x2+...+ cjpxp
>> 
>> where Cj is the score for the jth group, j = 1 â€¦ k, cjo is the
>> constant for the jth group, and x = raw scores of each predictor.
>> If W = within-group variance-covariance matrix, and M = column matrix
>> of means for group j, then the constant   cjo= (-1/2)CjMj" (Julia
>> Barfield, John Poulsen, and Aaron French 
>> http://userwww.sfsu.edu/~efc/classes/biol710/discrim/discriminant.htm).
>> 
>> I am unable to navigate this last step based on the R output I have.
>> I only have the linear discriminant coefficients for each predictor
>> that would be needed to complete this equation.
>> 
>> Please, if anybody is familiar or able to to help please let me know.
>> There is a spot in the acknowledgments for you.
>> 
>> All the best,
>> Chase Mendenhall
> 
> The first thing I did was to plot your data. This indicates in the
> first place that a perfect discrimination can be obtained on the
> basis of your variables WRMA_WT and WRMA_ID alone (names abbreviated
> to WG, WT, ID, SEX):
> 
>   d.csv("horsesLDA.csv")
>   # names(D0) # "WRMA_WG"  "WRMA_WT"  "WRMA_ID"  "WRMA_SEX"
>   WG<-D0$WRMA_WG; WT<-D0$WRMA_WT;
>   ID<-D0$WRMA_ID; SEX<-D0$WRMA_SEX
> 
>   ix.M<-(SEX=="M"); ix.F<-(SEX=="F")
> 
>   ## Plot WT vs ID (M & F)
>   plot(ID,WT,xlim=c(0,12),ylim=c(8,15))
>   points(ID[ix.M],WT[ix.M],pch="+",col="blue")
>   points(ID[ix.F],WT[ix.F],pch="+",col="red")
>   lines(ID,15.5-1.0*(ID))
> 
> and that there is a lot of possible variation in the discriminating
> line WT = 15.5-1.0*(ID)
> 
> Also, it is apparent that the covariance between WT and ID for Females
> is different from the covariance between WT and ID for Males. Hence
> the assumption (of common covariance matrix in the two groups) for
> standard LDA (which you have been applying) does not hold.
> 
> Given that the sexes can be perfectly discriminated within the data
> on the basis of the linear discriminator (WT + ID) (and others),
> the variable WG is in effect a close approximation to noise.
> 
> However, to the extent that there was a common covariance matrix
> to the two groups (in all three variables WG, WT, ID), and this
> was well estimated from the data, then inclusion of the third
> variable WG could yield a slightly improved discriminator in that
> the probability of misclassification (a rare event for such data)
> could be minimised. But it would not make much difference!
> 
> However, since that assumption does not hold, this analysis would
> not be valid.
> 
> If you plot WT vs WG, a common covariance is more plausible; but
> there is considerable overlap for these two variables:
> 
>   plot(WG,WT)
>   points(WG[ix.M],WT[ix.M],pch="+",col="blue")
>   points(WG[ix.F],WT[ix.F],pch="+",col="red")
> 
> If you plot WG vs ID, there is perhaps not much overlap, but a
> considerable difference in covariance between the two groups:
> 
>   plot(ID,WG)
>   points(ID[ix.M],WG[ix.M],pch="+",col="blue")
>   points(ID[ix.F],WG[ix.F],pch="+",col="red")
> 
> This looks better on a log scale, however:
> 
>   lWG <- log(WG) ; lWT <- log(WT) ; lID <- log(ID)
>## Plot log(WG) vs log(ID) (M & F)
>   plot(lID,lWG)
>   points(lID[ix.M],lWG[ix.M],pch="+",col="blue")
>   points(lID[ix.F],lWG[ix.F],pch="+",col="red")
> 
> and common covaroance still looks good for WG vs WT:
> 
>   ## Plot log(WT) vs log(WG) (M & F)
>   plot(lWG,lWT)
>   points(lWG[ix.M],lWT[ix.M],pch="+",col="blue")
>   points(lWG[ix.F],lWT[ix.F],pch="+",col="red")
> 
> but there is no improvement for WG vs IG:
> 
>   ## Plot log(WT) vs log(ID) (M & F)
>   plot(ID,WT,xlim=c(0,12),ylim=c(8,15))
>   points(ID[ix.M],WT[ix.M],pch="+",col="blue")
>   points(ID[ix.F],WT[ix.F],pch="+",col="red")

[***]
The above is incorrect! Apologies. I plotted the raw WT and ID
instead of their logs. In fact, if you do plot the logs:

  ## Plot log(WT) vs log(ID) (M & F)
  plot(lID,lWT)
  points(lID[ix.M],lWT[ix.M],pch="+",col="blue")
  points(lID[ix.F],lWT[ix.F],pch="+",col="red")

you now get what looks like much closer agreement between the
covariance cov(lID,lWT) then before. Hence, I would now suggest
that you do your limear discrimination on the logarithms of the
variables (since you also get agreement for the other pairs on
the log scale.

In fact:

[Raw]:
  [Male]:
  cov(cbind(WG,WT,ID)[ix.M,])
  #            WG         WT          ID
  # WG  2.2552465 0.11074710 -0.02202080
  # WT  0.1107471 0.33853450  0.06601287
  # ID -0.0220208 0.06601287  0.31979368

  [Female]:
  cov(cbind(WG,WT,ID)[ix.F,])
  #           WG        WT        ID
  # WG  2.4716912 0.1577307   0.6670657
  # WT  0.1577307 0.3183928   0.2973335
  # I D 0.6670657 0.2973335   2.8326520

[log]:
  [Male]:
  cov(cbind(lWG,lWT,lID)[ix.M,])
  #               lWG          lWT           lID
  # lWG  0.0006584465 0.0001813315 -0.0002133576
  # lWT  0.0001813315 0.0030368382  0.0030442356
  # lID -0.0002133576 0.0030442356  0.0693965979

  [Female]:
  cov(cbind(lWG,lWT,lID)[ix.F,])
  #              lWG          lWT         lID
  # lWG  0.0007244826 0.0002171885  0.001951343
  # lWT  0.0002171885 0.0019640076  0.003305884
  # lID  0.0019513428 0.0033058841  0.068406840

> So there is no simple road to applying a routine LDA to your data.
> 
> To take account of different covariances between the two groups,
> you would normally be looking at a quadratic discriminator. However,
> as indicated above, the fact that a linear discriminator using
> the variables ID & WT alone works so well would leave considerable
> imprecision in conclusions to be drawn from its results.
> 
> Sorry this is not the straightforward answer you were hoping for
> (which I confess I have not sought); it is simply a reaction to
> what your data say.
> 
> Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 24-May-09                                       Time: 21:49:50
------------------------------ XFMail ------------------------------