[R] Question about rpart decision trees (being used to predict customer churn)
Terry Therneau
therneau at mayo.edu
Mon Jul 27 14:42:29 CEST 2009
-- begin included message ---
Hi,
I am using rpart decision trees to analyze customer churn. I am finding that
the decision trees created are not effective because they are not able to
recognize factors that influence churn. I have created an example situation
below. What do I need to do to for rpart to build a tree with the variable
experience? My guess is that this would happen if rpart used the loss matrix
while creating the tree.
> experience <- as.factor(c(rep("good",90), rep("bad",10)))
> cancel <- as.factor(c(rep("no",85), rep("yes",5), rep("no",5),
rep("yes",5)))
> table(experience, cancel)
cancel
experience no yes
bad 5 5
good 85 5
> rpart(cancel ~ experience)
n= 100
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 100 10 no (0.9000000 0.1000000) *
I tried the following commands with no success.
rpart(cancel ~ experience, control=rpart.control(cp=.0001))
rpart(cancel ~ experience, parms=list(split='information'))
rpart(cancel ~ experience, parms=list(split='information'),
control=rpart.control(cp=.0001))
rpart(cancel ~ experience, parms=list(loss=matrix(c(0,1,10000,0), nrow=2,
ncol=2)))
--- end inclusion --------
The program works fine with rpart(as.numeric(cancel) ~ experience), which does
a fit to try and predict the probability of cancellation rather than a YES/NO
decision for each node. I usually find this more informative, particularly for
early analysis. Brieman et al in the original CART book refer to this as odds
regression. In this analysis, if a split leads to one child with 30% cancel and
another with 5% cancellation the split is successful. When using a factor as
the y variable, this split is scored as useless, since the parent and both
children are scored as "NO".
By adjusting the losses to be just right you can get your data to split.
You need to make them such that 85/5 is predicted as 'no cancel' and 5/5 as 'yes
cancel'; 1:2 losses would suffice. In the example where you set losses to
1:10000 both nodes are scored as a 'yes'.
Terry Therneau
More information about the R-help
mailing list