[R] rpart v. lda classification.

Wed Feb 12 09:10:08 CET 2003

On Tue, 11 Feb 2003, Rolf Turner wrote:

> 
> I've been groping my way through a classification/discrimination
> problem, from a consulting client.  There are 26 observations, with 4
> possible categories and 24 (!!!) potential predictor variables.
> 
> I tried using lda() on the first 7 predictor variables and got 24 of
> the 26 observations correctly classified.  (Training and testing both
> on the complete data set --- just to get started.)
> 
> I then tried rpart() for comparison and was somewhat surprised when
> rpart() only managed to classify 14 of the 26 observations correctly.
> (I got the same classification using just the first 7 predictors as I
> did using all of the predictors.)
> 
> I would have thought that rpart(), being unconstrained by a parametric
> model, would have a tendency to over-fit and therefore to appear to
> do better than lda() when the test data and training data are the
> same.
> 
> Am I being silly, or is there something weird going on?  I can
> give more detail on what I actually did, if anyone is interested.

The first.  rpart is seriously constrained by having so few observations,
and its model is much more restricted than lda: axis-parallel splits only.
There is a similar example, with pictures, in MASS (on Cushings).

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595