[R] Random Forest AUC

Sat Oct 23 03:18:39 CEST 2010

Let me expand on what Max showed.

For the most part, performance on training set is meaningless.  (That's
the case for most algorithms, but especially so for RF.)  In the default
(and recommended) setting, the trees are grown to the maximum size,
which means that quite likely there's only one data point in most
terminal nodes, and the prediction at the terminal nodes are determined
by the majority class in the node, or the lone data point.  Suppose that
is the case all the time; i.e., in all trees all terminal nodes have
only one data point.  A particular data point would be "in-bag" in about
64% of the trees in the forest, and every one of those trees has the
correct prediction for that data point.  Even if all the trees where
that data points are out-of-bag gave the wrong prediction, by majority
vote of all trees, you still get the right answer in the end.  Thus
basically the perfect prediction on train set for RF is "by design".

Generally, good training prediction is just self-fulfilling prophecy.

Andy

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of vioravis
> Sent: Friday, October 22, 2010 1:20 AM
> To: r-help at r-project.org
> Subject: [R] Random Forest AUC
> 
> 
> Guys,
> 
> I used Random Forest with a couple of data sets I had to 
> predict for binary
> response. In all the cases, the AUC of the training set is 
> coming to be 1.
> Is this always the case with random forests? Can someone 
> please clarify
> this? 
> 
> I have given a simple example, first using logistic 
> regression and then
> using random forests to explain the problem. AUC of the 
> random forest is
> coming out to be 1.
> 
> data(iris)
> iris <- iris[(iris$Species != "setosa"),]
> iris$Species <- factor(iris$Species)
> fit <- glm(Species~.,iris,family=binomial)
> train.predict <- predict(fit,newdata = iris,type="response")          
> library(ROCR)
> plot(performance(prediction(train.predict,iris$Species),"tpr",
> "fpr"),col =
> "red")
> auc1 <-
> performance(prediction(train.predict,iris$Species),"auc")@y.va
> lues[[1]]
> legend("bottomright",legend=c(paste("Logistic Regression
> (AUC=",formatC(auc1,digits=4,format="f"),")",sep="")),  
> 		col=c("red"), lty=1)
> 
> 
> library(randomForest)
> fit <- randomForest(Species ~ ., data=iris, ntree=50)
> train.predict <- predict(fit,iris,type="prob")[,2]          
> plot(performance(prediction(train.predict,iris$Species),"tpr",
> "fpr"),col =
> "red")
> auc1 <-
> performance(prediction(train.predict,iris$Species),"auc")@y.va
> lues[[1]]
> legend("bottomright",legend=c(paste("Random Forests
> (AUC=",formatC(auc1,digits=4,format="f"),")",sep="")),  
> 		col=c("red"), lty=1)
> 
> Thank you.
> 
> Regards,
> Ravishankar R
> -- 
> View this message in context: 
> http://r.789695.n4.nabble.com/Random-Forest-AUC-tp3006649p3006649.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}