[R] questions on rpart (tree changes when rearrange the order of covariates?!)

Liaw, Andy andy_liaw at merck.com
Wed May 13 14:03:12 CEST 2009


From: Uwe Ligges
> 
> Yuanyuan wrote:
> > Greetings,
> > 
> > I am using rpart for classification with "class" method. 
> The test data  is
> > the Indian diabetes data from package mlbench.
> > 
> > I fitted a classification tree firstly using the original 
> data, and then
> > exchanged the order of Body mass and Plasma glucose which are the
> > strongest/important variables in the growing phase. The 
> second tree is a
> > little different from the first one. The misclassification 
> tables are
> > different too. I did not change the data, but why the results are so
> > different?
> 
> Well, at some splits the variable that comes first and yields in the 
> same reduction of the entropy criterion as another one might be used, 
> hence another result.
> 
> Uwe Ligges

I recently tried writing adaboost.m1 using rpart, and was surprised that
with very small training set (say n=10 or 20), I get a large improvement
in test set accuracy if I randomly shuffle the columns in the data at
every adaboost iteration.  (With twonorm data, we're talking about 25%
error vs. 19%, using n=2000 test set.)  It turned out to be the way
rpart deals with ties--- first come, first win.  Without shuffling the
columns, rpart almost never pick any variable beyond the 10th.  (In
twonorm, all variables are equally important, so one would expect
roughly equal selection frequency.)  

I've gotten some pointers from Terry Therneau about where in the code to
check.  I may try to implement breaking ties at random (as I've done in
randomForest).  No promises, though...

Andy
 
> 
> 
> 
> > 
> > Does anyone know how rpart deal with ties?
> > 
> > Here is the codes for running the two trees.
> > 
> > 
> > library(mlbench)
> > data(PimaIndiansDiabetes2)
> > mydata<-PimaIndiansDiabetes2
> > library(rpart)
> > fit2<-rpart(diabetes~., data=mydata,method="class")
> > plot(fit2,uniform=T,main="CART for original data")
> > text(fit2,use.n=T,cex=0.6)
> > printcp(fit2)
> > table(predict(fit2,type="class"),mydata$diabetes)
> > ## misclassifcation table: rows are fitted class
> >       neg pos
> >   neg 437  68
> >   pos  63 200
> > #Klimt(fit2,mydata)
> > 
> > pmydata<-data.frame(mydata[,c(1,6,3,4,5,2,7,8,9)])
> > fit3<-rpart(diabetes~., data=pmydata,method="class")
> > plot(fit3,uniform=T,main="CART after exchaging mass & glucose")
> > text(fit3,use.n=T,cex=0.6)
> > printcp(fit3)
> > table(predict(fit3,type="class"),pmydata$diabetes)
> > ##after exchage the order of BODY mass and PLASMA glucose
> >       neg pos
> >   neg 436  64
> >   pos  64 204
> > #Klimt(fit3,pmydata)
> > 
> > 
> > Thanks,
> > 
> > 
> > 
> --------------------------------------------------------------
> ------------------------
> > Yuanyuan Huang
> > 
> > 	[[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:12}}




More information about the R-help mailing list