[R] Question on class 1, 2 output for RandomForest
Liaw, Andy
andy_liaw at merck.com
Wed Mar 23 16:31:22 CET 2005
The `1' and `2' columns are the error rates within those classes. E.g., the
last row of the `1' column should correspond to the class.error for "-", and
the last row of the `2' column to the class.error for "+". (I would
have thought that that should be fairly obvious, but I guess not. It mimics
what Breiman and Cutler's Fortran code does.) I suspect you showed us the
output from two different runs, so they don't match. It does for me:
> library(randomForest)
randomForest 4.5-4
Type rfNews() to see new features/changes/bug fixes.
> credit <- read.csv(url("ftp://ftp.ics.
> credit <-
read.csv(url("ftp://ftp.ics.uci.edu/pub/machine-learning-databases/credit-sc
reening/crx.data"), header=FALSE, na.string="?")
> credit.rf <- randomForest(V16~., credit, imp=T, do.trace=100,
na.action=na.omit)
ntree OOB 1 2
100: 20.37% 14.01% 28.04%
200: 21.59% 15.41% 29.05%
300: 20.52% 13.45% 29.05%
400: 20.52% 13.17% 29.39%
500: 20.21% 12.61% 29.39%
> credit.rf
Call:
randomForest(x = V16 ~ ., data = credit, imp = T, do.trace = 100,
na.action = na.omit)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 20.21%
Confusion matrix:
- + class.error
- 312 45 0.1260504
+ 87 209 0.2939189
The article in R News was written for the first version of the package. It
has changed quite a bit in many respects since then. The `class error' may
be important, e.g., if one of the classes only make up a small proportion of
the data.
Andy
> From: Melanie Vida
>
> Hi All,
>
> I read the R-newsletter Volum 2/3, December 2002 on page 18.
> I tried the
> example there, too. Then, I used a different data set with
> random Forest
> from the UCI respository. The results for the "credit" data
> generated 2
> additional columns, column "1" and a column "2" that the
> example given
> in the newsletter did not generate from the fgl data set.
>
> For the "credit" data, what does the output with the heading
> "1", " 2"
> imply for ntree=100...500 (below)? Does the "1" imply the
> actual data,
> "class 1" and a group of synthetic data "2" -> "class 2"? Did
> my random
> forest automatically default to unsupervised learning and
> automatically
> create the class 2, synthetic data, then classify the
> combined data with
> the random Forest? If so, which method did R used to generate the
> synthetic data? The newsletter states that there are 2 ways
> to generate
> synthetic data.
>
> Further, the parameters to tune these randomForest would ideally
> optimize the OOB error rate and whatever column 1 and 2 error rates
> mean? I tried mtry=2, 3 and 10, but that didn't change the
> errors much.
> Are these results reasonable, or should I tried to tune different
> parameters for this special case?
>
> ntree OOB 1 2
> 100: 20.72% 14.10% 28.99%
> 200: 18.99% 13.58% 25.73%
> 300: 19.71% 15.14% 25.41%
> 400: 20.00% 14.10% 27.36%
> 500: 19.13% 13.58% 26.06%
>
> Call:
> randomForest(x = V16 ~ ., data = credit, mtry = 3, importance =
> TRUE, do.trace = 100)
> Type of random forest: classification
> Number of trees: 500
> No. of variables tried at each split: 3
>
> OOB estimate of error rate: 19.86%
> Confusion matrix:
> - + class.error
> - 326 57 0.1488251
> + 80 227 0.2605863
>
>
> Thanks in advance,
>
> -Melanie
> -------
> # Read in the credit table
> credit =
> read.table(url('ftp://ftp.ics.uci.edu/pub/machine-learning-dat
abases/credit-screening/crx.data'),sep=",")
> str(credit)
> credit$V2 = as.numeric(credit$V2)
> credit$V14 = as.numeric(credit$V14)
> str(credit)
>
> credit.rf <- randomForest(V16 ~ ., data=credit, mtry=3, importance =
> TRUE, do.trace=100)
> print(credit.rf)
>
>
> -Melanie
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>
>
More information about the R-help
mailing list