[R] simplifying randomForest(s)

Liaw, Andy andy_liaw at merck.com
Tue Sep 16 14:14:45 CEST 2003


> From: Ramon Diaz-Uriarte [mailto:rdiaz at cnio.es] 
> Dear All,
> I have been using the randomForest package for a couple of difficult 
> prediction problems (which also share p >> n). The 
> performance is good, but 
> since all the variables in the data set are used, 
> interpretation of what is 
> going on is not easy, even after looking at variable 
> importance as produced 
> by the randomForest run.
> I have tried a simple "variable selection" scheme, and it 
> does seem to perform 
> well (as judged by leave-one-out) but I am not sure if it 
> makes any sense.  
> The idea is, in a kind of backwards elimination,  to 
> eliminate one by one the 
> variables with smallest importance (or all the ones with 
> negative importance 
> in one go) until the out-of-bag estimate of classification 
> error becames 
> larger than that of the previous model (or of the initial 
> model). So nothing 
> really new. But I haven't been able to find any comments in 
> the literature 
> about "simplification" of random forests. 

This is quite a hazardous game.  We've been burned by this ourselves.  I'll
send you a paper we submitted on variable selection for random forest
off-line.  (Those who are interested, let me know.)

The basic problem is that when you select important variables by RF and then
re-run RF with those variables, the OOB error rate become biased downward.
As you iterate more times, the "overfitting" becomes more and more severe
(in the sense that, the OOB error rate will keep decreasing while error rate
on an independent test set will be flat or increases).  I was naïve enough
to ask Breiman about this, and his reply was something like "any competent
statistician would know that you need something like cross-validation to do

In the upcoming version 5 of Breiman's Fortran code, he offers an option to
run RF twice, first time with all variables, and the second with the k
(selected by user) most important variables from the 1st run.  The OOB error
rate from the 2nd run is no longer unbiased, but the bias is probably not
too severe with only one iteration.

> Any suggestions/comments?
> Best,
> Ramón
> -- 
> Ramón Díaz-Uriarte
> Bioinformatics Unit
> Centro Nacional de Investigaciones Oncológicas (CNIO)
> (Spanish National Cancer Center)
> Melchor Fernández Almagro, 3
> 28029 Madrid (Spain)
> Fax: +-34-91-224-6972
> Phone: +-34-91-224-6900

R-help at stat.math.ethz.ch mailing list

More information about the R-help mailing list