[R] [correction] Animal Morphology: Deriving Classification Equation with
(Ted Harding)
Ted.Harding at manchester.ac.uk
Sun May 24 22:49:53 CEST 2009
[Apologies -- I made an error (see at [***] near the end)]
On 24-May-09 19:07:46, Ted Harding wrote:
> [Your data and output listings removed. For comments, see at end]
>
> On 24-May-09 13:01:26, cdm wrote:
>> Fellow R Users:
>> I'm not extremely familiar with lda or R programming, but a recent
>> editorial review of a manuscript submission has prompted a crash
>> course. I am on this forum hoping I could solicit some much needed
>> advice for deriving a classification equation.
>>
>> I have used three basic measurements in lda to predict two groups:
>> male and female. I have a working model, low Wilk's lambda, graphs,
>> coefficients, eigenvalues, etc. (see below). I adjusted the sample
>> analysis for Fisher's or Anderson's Iris data provided in the MASS
>> library for my own data.
>>
>> My final and last step is simply form the classification equation.
>> The classification equation is simply using standardized coefficients
>> to classify each group- in this case male or female. A more thorough
>> explanation is provided:
>>
>> "For cases with an equal sample size for each group the classification
>> function coefficient (Cj) is expressed by the following equation:
>>
>> Cj = cj0+ cj1x1+ cj2x2+...+ cjpxp
>>
>> where Cj is the score for the jth group, j = 1 ⦠k, cjo is the
>> constant for the jth group, and x = raw scores of each predictor.
>> If W = within-group variance-covariance matrix, and M = column matrix
>> of means for group j, then the constant cjo= (-1/2)CjMj" (Julia
>> Barfield, John Poulsen, and Aaron French
>> http://userwww.sfsu.edu/~efc/classes/biol710/discrim/discriminant.htm).
>>
>> I am unable to navigate this last step based on the R output I have.
>> I only have the linear discriminant coefficients for each predictor
>> that would be needed to complete this equation.
>>
>> Please, if anybody is familiar or able to to help please let me know.
>> There is a spot in the acknowledgments for you.
>>
>> All the best,
>> Chase Mendenhall
>
> The first thing I did was to plot your data. This indicates in the
> first place that a perfect discrimination can be obtained on the
> basis of your variables WRMA_WT and WRMA_ID alone (names abbreviated
> to WG, WT, ID, SEX):
>
> d.csv("horsesLDA.csv")
> # names(D0) # "WRMA_WG" "WRMA_WT" "WRMA_ID" "WRMA_SEX"
> WG<-D0$WRMA_WG; WT<-D0$WRMA_WT;
> ID<-D0$WRMA_ID; SEX<-D0$WRMA_SEX
>
> ix.M<-(SEX=="M"); ix.F<-(SEX=="F")
>
> ## Plot WT vs ID (M & F)
> plot(ID,WT,xlim=c(0,12),ylim=c(8,15))
> points(ID[ix.M],WT[ix.M],pch="+",col="blue")
> points(ID[ix.F],WT[ix.F],pch="+",col="red")
> lines(ID,15.5-1.0*(ID))
>
> and that there is a lot of possible variation in the discriminating
> line WT = 15.5-1.0*(ID)
>
> Also, it is apparent that the covariance between WT and ID for Females
> is different from the covariance between WT and ID for Males. Hence
> the assumption (of common covariance matrix in the two groups) for
> standard LDA (which you have been applying) does not hold.
>
> Given that the sexes can be perfectly discriminated within the data
> on the basis of the linear discriminator (WT + ID) (and others),
> the variable WG is in effect a close approximation to noise.
>
> However, to the extent that there was a common covariance matrix
> to the two groups (in all three variables WG, WT, ID), and this
> was well estimated from the data, then inclusion of the third
> variable WG could yield a slightly improved discriminator in that
> the probability of misclassification (a rare event for such data)
> could be minimised. But it would not make much difference!
>
> However, since that assumption does not hold, this analysis would
> not be valid.
>
> If you plot WT vs WG, a common covariance is more plausible; but
> there is considerable overlap for these two variables:
>
> plot(WG,WT)
> points(WG[ix.M],WT[ix.M],pch="+",col="blue")
> points(WG[ix.F],WT[ix.F],pch="+",col="red")
>
> If you plot WG vs ID, there is perhaps not much overlap, but a
> considerable difference in covariance between the two groups:
>
> plot(ID,WG)
> points(ID[ix.M],WG[ix.M],pch="+",col="blue")
> points(ID[ix.F],WG[ix.F],pch="+",col="red")
>
> This looks better on a log scale, however:
>
> lWG <- log(WG) ; lWT <- log(WT) ; lID <- log(ID)
>## Plot log(WG) vs log(ID) (M & F)
> plot(lID,lWG)
> points(lID[ix.M],lWG[ix.M],pch="+",col="blue")
> points(lID[ix.F],lWG[ix.F],pch="+",col="red")
>
> and common covaroance still looks good for WG vs WT:
>
> ## Plot log(WT) vs log(WG) (M & F)
> plot(lWG,lWT)
> points(lWG[ix.M],lWT[ix.M],pch="+",col="blue")
> points(lWG[ix.F],lWT[ix.F],pch="+",col="red")
>
> but there is no improvement for WG vs IG:
>
> ## Plot log(WT) vs log(ID) (M & F)
> plot(ID,WT,xlim=c(0,12),ylim=c(8,15))
> points(ID[ix.M],WT[ix.M],pch="+",col="blue")
> points(ID[ix.F],WT[ix.F],pch="+",col="red")
[***]
The above is incorrect! Apologies. I plotted the raw WT and ID
instead of their logs. In fact, if you do plot the logs:
## Plot log(WT) vs log(ID) (M & F)
plot(lID,lWT)
points(lID[ix.M],lWT[ix.M],pch="+",col="blue")
points(lID[ix.F],lWT[ix.F],pch="+",col="red")
you now get what looks like much closer agreement between the
covariance cov(lID,lWT) then before. Hence, I would now suggest
that you do your limear discrimination on the logarithms of the
variables (since you also get agreement for the other pairs on
the log scale.
In fact:
[Raw]:
[Male]:
cov(cbind(WG,WT,ID)[ix.M,])
# WG WT ID
# WG 2.2552465 0.11074710 -0.02202080
# WT 0.1107471 0.33853450 0.06601287
# ID -0.0220208 0.06601287 0.31979368
[Female]:
cov(cbind(WG,WT,ID)[ix.F,])
# WG WT ID
# WG 2.4716912 0.1577307 0.6670657
# WT 0.1577307 0.3183928 0.2973335
# I D 0.6670657 0.2973335 2.8326520
[log]:
[Male]:
cov(cbind(lWG,lWT,lID)[ix.M,])
# lWG lWT lID
# lWG 0.0006584465 0.0001813315 -0.0002133576
# lWT 0.0001813315 0.0030368382 0.0030442356
# lID -0.0002133576 0.0030442356 0.0693965979
[Female]:
cov(cbind(lWG,lWT,lID)[ix.F,])
# lWG lWT lID
# lWG 0.0007244826 0.0002171885 0.001951343
# lWT 0.0002171885 0.0019640076 0.003305884
# lID 0.0019513428 0.0033058841 0.068406840
> So there is no simple road to applying a routine LDA to your data.
>
> To take account of different covariances between the two groups,
> you would normally be looking at a quadratic discriminator. However,
> as indicated above, the fact that a linear discriminator using
> the variables ID & WT alone works so well would leave considerable
> imprecision in conclusions to be drawn from its results.
>
> Sorry this is not the straightforward answer you were hoping for
> (which I confess I have not sought); it is simply a reaction to
> what your data say.
>
> Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 24-May-09 Time: 21:49:50
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list