[R] rpart problem

pfm401@lineone.net pfm401 at lineone.net
Mon Sep 6 11:51:28 CEST 2004

Dear all,

I am having some trouble with getting the rpart function to work as expected.
I am trying to use rpart to combine levels of a factor to reduce the number
of levels of that factor. In exploring the code I have noticed that it is
possible for chisq.test to return a statistically significant result whilst
the rpart method returns only the root node (i.e. no split is made). The
following code recreates the issue using simulated data :

# Create a 2 level factor with group 1 probability of success 90% and group
2 60%
tmp1  <- as.factor((runif (1000) <= 0.9))
tmp2  <- as.factor((runif (1000) <= 0.5))
mysuccess <- as.factor(c(tmp1, tmp2)) 
mygroup   <- as.factor(c(rep (1,1000), rep (2,1000)))

table (mysuccess, mygroup)
chisq.test (mysuccess, mygroup)
# p-value = < 2.2e-16

myrpart <- rpart (mysuccess ~ mygroup)
# rpart does not provide splits !!

If I change the parameter in the setting of group 2 to 0.3 from 0.6 rpart
does return splits, i.e. change the line 

tmp2  <- as.factor((runif (1000) <= 0.6))


tmp2  <- as.factor((runif (1000) <= 0.3))

rpart does split the nodes, but as the split with 0.6 is highly significant
I would still have expected a split in this case too.

I would appreciate any advice as to whether this is a known feature of rpart,
whether I need to change the way my data are stored, or set some of the
control options. I have tested a few of these options with no success.


Get Tiscali Broadband From £15:99

More information about the R-help mailing list