[R] R: How to prune using holdout data
Therneau, Terry M., Ph.D.
therneau at mayo.edu
Tue Feb 28 15:26:13 CET 2017
Let me give an outline of how to answer Alfredo's question via an example.
I will split the data set "lung" into two peices. For these subjects with
advanced lung cancer the physician's assessment of ECOG performance status
(ph.ecog) is one of the most powerful indicators of outcome. Try to
predict it from other variables.
library(survival) # for the test data set
library(rpart)
data1 <- lung[1:125,]
data2 <- lung[126:228,]
rfit1 <- rpart(ph.ecog ~ ., data=data1)
printcp(rfit1)
CP nsplit rel error xerror xstd
1 0.565788 0 1.00000 1.04037 0.100516
2 0.098516 1 0.43421 0.44906 0.045001
3 0.042708 2 0.33570 0.35134 0.041692
4 0.031032 3 0.29299 0.37610 0.042971
5 0.019949 4 0.26196 0.37753 0.044692
6 0.010000 5 0.24201 0.39166 0.050332
# Validate using data2. First get predictions for each of the pruned trees
cpvalues <- rfit1$cptable[,1]
pmat <- matrix(0, nrow(data2), length(cpvalues))
for (i in 1:length(cpvalues))
pmat[,i] <- predict(prune(rfit1, cpvalues[i]), newdata=data2)
Now, we need to decide what on a measure of error. Try simple squared error.
error <- colMeans((data2$ph.ecog - pmat)^2)
round(error, 3)
[1] 0.493 0.280 0.210 0.225 0.186 0.198
This is simple, but other cases are more complex. The performace score is
actually an integer from 0-4 (5= dead), see
http://ecog-acrin.org/resources/ecog-performance-status
table(lung$ph.ecog)
0 1 2 3
63 113 50 1
Suppose instead we fit a model and treat the response as categorical?
The total number of nested models is a bit smaller.
rfit2 <- rpart(ph.ecog ~ ., data=data1, method="class")
printcp(rfit2)
CP nsplit rel error xerror xstd
1 0.35938 0 1.00000 1.00000 0.086951
2 0.12500 1 0.64062 0.64062 0.081854
3 0.06250 2 0.51562 0.70312 0.083662
4 0.03125 4 0.39062 0.57812 0.079610
5 0.01000 5 0.35938 0.56250 0.078977
predict(rfit2, newdata=data2)[1:5,]
0 1 2 3
126 0.03125 0.9375 0.03125 0
127 0.03125 0.9375 0.03125 0
128 0.03125 0.9375 0.03125 0
129 0.03125 0.9375 0.03125 0
130 0.37500 0.6250 0.00000 0
Now, we can ask for predicted probabilities for each class (default), which is a vector
of length 4 for each subject, or for the predicted class, which is a single value. Which
do we want, and then what is the best measure of prediction error?
If three subjects with value 0 had prediction class vectors of (.8, .2, 0, 0),
(.8, .1, .1, 0) and (.45, .25, .2, .1), one outlook would say they all are the
same (all pick 0 as the best), others would give them different errors. Is
the second prediction worse than the first?
What if the single subject with ph.ecog=3 had ended up in the validation data
set; how should we judge their prediction?
This complexity is one reason that there is not a simple function for
"validation" with a new data set.
On 02/27/2017 09:48 AM, Alfredo wrote:
> Thank you, Terry, for your answer.
>
> I’ll try to explain better my question. When you create a classification or regression
> tree you first grow a tree based on a splitting criteria: this usually results in a large
> tree that provides a good fit to the training data. The problem with this tree is its
> potential for overfitting the data: the tree can be tailored too specifically to the
> training data and not generalize well to new data. The solution (apart cross-validation)
> is to find a smaller subtree that results in a low error rate on *holdout or validation data.*
>
> Hope it helps to clarity my question.
>
> Best,
>
> Alfredo
>
> -----Messaggio originale-----
> Da: Therneau, Terry M., Ph.D. [mailto:therneau at mayo.edu]
>
> You will need to give more detail of exactly what you mean by "prune using a validation
> set". THe prune.rpart function will prune at any value you want, what I suspect you are
> looking for is to compute the error of each possible tree, using a validation data set,
> then find the best one, and then prune there.
>
> How do you define "best"?
>
More information about the R-help
mailing list