[R] simplifying randomForest(s)

Tue Sep 16 11:44:21 CEST 2003

Dear All,

I have been using the randomForest package for a couple of difficult 
prediction problems (which also share p >> n). The performance is good, but 
since all the variables in the data set are used, interpretation of what is 
going on is not easy, even after looking at variable importance as produced 
by the randomForest run.

I have tried a simple "variable selection" scheme, and it does seem to perform 
well (as judged by leave-one-out) but I am not sure if it makes any sense.  
The idea is, in a kind of backwards elimination,  to eliminate one by one the 
variables with smallest importance (or all the ones with negative importance 
in one go) until the out-of-bag estimate of classification error becames 
larger than that of the previous model (or of the initial model). So nothing 
really new. But I haven't been able to find any comments in the literature 
about "simplification" of random forests. 

Any suggestions/comments?

Best,

Ramón

-- 
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://bioinfo.cnio.es/~rdiaz