[R] rpart - classification and regression trees (CART)
Katie N
knishimura at gmail.com
Sat Dec 12 22:14:59 CET 2009
Hi,
I had a question regarding the rpart command in R. I used seven continuous
predictor variables in the model and the variable called "TB122" was chosen
for the first split. But in looking at the output, there are 4 variables
that improve the predicted membership equally (TB122, TB139, TB144, and
TB118) - output pasted below.
Node number 1: 268 observations, complexity param=0.6
predicted class=0 expected loss=0.3
class counts: 197 71
probabilities: 0.735 0.265
left son=2 (188 obs) right son=3 (80 obs)
Primary splits:
TB122 < 80 to the left, improve=50, (0 missing)
TB139 < 90 to the left, improve=50, (0 missing)
TB144 < 90 to the left, improve=50, (0 missing)
TB118 < 90 to the left, improve=50, (0 missing)
TB129 < 100 to the left, improve=40, (0 missing)
I need to know what methods R is using to select the best variable for the
node. Somewhere I read that the best split = greatest improvement in
predictive accuracy = maximum homogeneity of yes/no groups resulting from
the split = reduction of impurity. I also read that the Gini index,
Chi-square, or G-square can be used evaluate the level of impurity.
For this function in R:
1) Why exactly did R pick TB122 over the other variables despite the fact
that they all had the same level of improvement? Was TB122 chosen to be the
first node because the groups "TB122<80" and "TB122>80" were the most
homogeneous (ie had the least impurity)?
2) If R is using impurity to determine the best nodes, which method (the
Gini index, Chi-square, or G-square) is R using?
Thanks!
Katie
--
View this message in context: http://n4.nabble.com/rpart-classification-and-regression-trees-CART-tp962680p962680.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list