[R] rpart problem
Prof Brian Ripley
ripley at stats.ox.ac.uk
Mon Sep 6 22:50:46 CEST 2004
I think you are confusing the purpose of rpart, which is prediction.
You want to predict `mysuccess'.
One group has 90% success, so the best prediction is `success'.
The other group has 60% success, so the best prediction is `success'.
So there is no point in splitting into groups. Replace 60% by 30% and the
best prediction for group 2 changes.
If this is not now obvious, please read up on tree-based methods.
On Mon, 6 Sep 2004 pfm401 at lineone.net wrote:
> Dear all,
>
> I am having some trouble with getting the rpart function to work as expected.
> I am trying to use rpart to combine levels of a factor to reduce the number
> of levels of that factor. In exploring the code I have noticed that it is
> possible for chisq.test to return a statistically significant result whilst
> the rpart method returns only the root node (i.e. no split is made). The
> following code recreates the issue using simulated data :
>
>
> # Create a 2 level factor with group 1 probability of success 90% and group
> 2 60%
> tmp1 <- as.factor((runif (1000) <= 0.9))
> tmp2 <- as.factor((runif (1000) <= 0.5))
Is 0.5 a typo?
> mysuccess <- as.factor(c(tmp1, tmp2))
> mygroup <- as.factor(c(rep (1,1000), rep (2,1000)))
>
> table (mysuccess, mygroup)
> chisq.test (mysuccess, mygroup)
> # p-value = < 2.2e-16
>
> myrpart <- rpart (mysuccess ~ mygroup)
> myrpart
> # rpart does not provide splits !!
>
>
>
> If I change the parameter in the setting of group 2 to 0.3 from 0.6 rpart
> does return splits, i.e. change the line
>
> tmp2 <- as.factor((runif (1000) <= 0.6))
>
> to
>
> tmp2 <- as.factor((runif (1000) <= 0.3))
>
> rpart does split the nodes, but as the split with 0.6 is highly significant
> I would still have expected a split in this case too.
>
>
> I would appreciate any advice as to whether this is a known feature of rpart,
> whether I need to change the way my data are stored, or set some of the
> control options. I have tested a few of these options with no success.
Testing cp < 0 will have an effect.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list