[R] RE: random forests for R

Warnes, Gregory R gregory_r_warnes at groton.pfizer.com
Wed Apr 3 13:55:27 CEST 2002

Hi Andy,

I'm glad to see that someone has put up an R package of Leo's code.  I made
an R package using his first release of the code, but never had/took the
time to push it through the publication review process here so that I could
distribute it.  I'm glad you have.


> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw at merck.com]
> Sent: Tuesday, April 02, 2002 10:23 AM
> To: 'r-announce at lists.R-project.org'
> Subject: random forests for R
> Hi all,
> There is now a package available on CRAN that provides an R 
> interface to Leo
> Breiman's random forest classifier.
> Basically, random forest does the following:
> 1.  Select ntree, the number of trees to grow, and mtry, a 
> number no larger
> than number of variables.
> 2.  For i = 1 to ntree:
> 3.  Draw a bootstrap sample from the data.  Call those not in 
> the bootstrap
> sample the "out-of-bag" data.
> 4.  Grow a "random" tree, where at each node, the best split 
> is chosen among
> mtry randomly selected variables.  The tree is grown to 
> maximum size and not
> pruned back.
> 5.  Use the tree to predict out-of-bag data.
> 6.  In the end, use the predictions on out-of-bag data to 
> form majority
> votes.
> 7.  Prediction of test data is done by majority votes from 
> predictions from
> the ensemble of trees.
> In the tech report
> http://oz.berkeley.edu/users/breiman/randomforest2001.pdf, 
> Breiman showed
> that this technique is very competitive to boosting 
> classification trees.
> In our own experience, it is competitive with nonlinear 
> classifiers such as
> artificial neural nets and support vector machines.  Two of 
> the significant
> advantages of random forests over other methods (IMHO) are: 
> a) there is only
> one parameter (mtry) to adjust, and the result usually not 
> sensititve to it;
> and b) the built-in cross-validation via the use of 
> out-of-bag data gives
> quite accurate estimate of test set error, and offers quite effective
> protection against overfitting.
> The code is based on version 3.1 of the original Fortran code 
> written by
> Breiman and Cutler 
(http://www.stat.berkeley.edu/users/breiman/).  The User
Guide for the Fortran code on Breiman's web site explains some of the
facilities provided in the code (such as assessing variable importance, and
proximity measures).  Some facilities provided in the original Fortran code
have be taken out:  transforming data to principal components, and
multidimensional scaling of the "proximity" matrix.  These can easily be
done in R before and after calls to the random forest functions.  Random
numbers are generated by R's RNG, rather than the one supplied in the
original Fortran code.

I'd like to thank Profs. B. D. Ripley, J. Lindsey, and others on R-help that
answered many of my questions when I was working on this package.  The
formula interface and part of the code in the predict method are out-right
"stolen" from svm() in the e1071 package and nnet() in the VR bundle.

Questions/comments/bugs/patches welcomed!

Andy I. Liaw, PhD
Biometrics Research          Phone: (732) 594-0820
Merck & Co., Inc.              Fax: (732) 594-1565
P.O. Box 2000, RY70-38            Rahway, NJ 07065
mailto:andy_liaw at merck.com

Notice: This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that
may be confidential, proprietary copyrighted and/or legally privileged, and
is intended solely for the use of the individual or entity named on this
message.  If you are not the intended recipient, and have received this
message in error, please immediately return this by e-mail and then delete


r-announce mailing list -- Read
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-announce-request at stat.math.ethz.ch

Unless expressly stated otherwise, this message is confidential and may be privileged. It is intended for the addressee(s) only. Access to this E-mail by anyone else is unauthorized. If you are not an addressee, any disclosure or copying of the contents of this E-mail or any action taken (or not taken) in reliance on it is unauthorized and may be unlawful. If you are not an addressee, please inform the sender immediately.
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list