[R] rpart v. lda classification.

Wed Feb 12 00:30:03 CET 2003

I've been groping my way through a classification/discrimination
problem, from a consulting client.  There are 26 observations, with 4
possible categories and 24 (!!!) potential predictor variables.

I tried using lda() on the first 7 predictor variables and got 24 of
the 26 observations correctly classified.  (Training and testing both
on the complete data set --- just to get started.)

I then tried rpart() for comparison and was somewhat surprised when
rpart() only managed to classify 14 of the 26 observations correctly.
(I got the same classification using just the first 7 predictors as I
did using all of the predictors.)

I would have thought that rpart(), being unconstrained by a parametric
model, would have a tendency to over-fit and therefore to appear to
do better than lda() when the test data and training data are the
same.

Am I being silly, or is there something weird going on?  I can
give more detail on what I actually did, if anyone is interested.

The data are pretty obviously nothing like Gaussian, so my
gut feeling is that rpart() should be much more appropriate than
lda().  And it does not seem surprizing that with so few
observations to train with, the success rate should be low, even
when testing and training on the same data set.  What does
surprise me is that lda() gets such a high success rate.

Should I just put this down as a random occurrence of a low
prob. event?

				cheers,

					Rolf Turner
					rolf at math.unb.ca

P.S.  Using CV=TRUE in lda() I got only 16 of the 26 observations
correctly classified.