[R] Random Forest - Strata
Max Kuhn
mxkuhn at gmail.com
Tue Jul 27 22:31:30 CEST 2010
The index indicates which samples should go into the training set.
However, you are using out of bag sampling, so it would use the whole
training set and return the OOB error (instead of the error estimates
that would be produced by resampling via the index).
Which do you want? OOB estimates or other estimates? Based on your
previous email, I figured you would have an index list with three sets
of sample indicies for sites A+B, sites A+C and sites B+C. In this way
you would do three resamples: the first fits using data from sites A
&B, then predicts on C (and so on). In this way, the resampled error
estimates would be based on the average of the three hold-out sets
(actually hold-out sites). OOB error doesn't sound like what you want.
MAx
On Tue, Jul 27, 2010 at 2:46 PM, Coll <gbcoll2 at gmail.com> wrote:
>
> Thanks for all the help.
>
> I had tried using the "index" in caret to try to dictate which rows of the
> sample would be used in each of the tree building in RF. (e.g. use all data
> from A B site for training, hold out all data from C site for testing etc)
>
> However after running, when I cross-checked the "index" that goes to train
> function and the "inbag" in the resulting randomForest object, I found the
> two didn't match.
>
> Shown as below:
>
>> data(iris)
>> tmpIrisIndex <- createDataPartition(iris$Species, p=0.632, times = 10)
>> head(tmpIrisIndex,3)
> [[1]]
> [1] 1 2 3 7 10 11 12 13 16 18 20 22 24 25 26 27 28 29
> 31
> [20] 34 35 36 37 38 39 40 41 43 46 47 48 50 52 53 55 56 57
> 58
> [39] 61 64 65 66 67 68 69 71 74 75 76 77 79 82 83 84 85 86
> 88
> [58] 90 91 92 94 96 98 99 102 103 104 106 108 109 111 112 113 114 115
> 116
> [77] 117 119 120 121 123 126 128 129 130 131 132 134 136 139 140 141 143 146
> 147
> [96] 150
>
> [[2]]
> [1] 1 3 6 7 8 10 12 13 14 16 18 20 21 22 23 24 26 27
> 28
> [20] 29 30 32 34 35 36 38 42 44 46 47 48 50 51 53 54 55 58
> 60
> [39] 61 62 67 68 69 70 72 73 74 76 77 79 81 82 83 85 86 88
> 89
> [58] 90 92 93 95 97 99 100 103 104 105 107 108 109 111 112 113 114 117
> 119
> [77] 120 121 122 123 124 125 127 130 132 133 134 135 137 139 140 141 142 145
> 147
> [96] 149
>
> [[3]]
> [1] 1 5 7 9 10 11 12 14 18 20 21 22 23 24 26 29 30 31
> 33
> [20] 34 35 36 37 38 39 40 44 45 46 47 48 49 51 52 53 54 56
> 58
> [39] 61 63 65 66 69 70 72 74 75 76 77 78 79 80 82 83 85 86
> 87
> [58] 90 91 92 93 94 98 100 102 103 105 106 107 109 110 113 114 115 116
> 117
> [77] 121 122 123 124 125 128 129 130 131 132 133 134 135 138 139 140 141 142
> 146
> [96] 150
>
>> irisTrControl <- trainControl(method = "oob", index = tmpIrisIndex)
>> rf.iris.obj <-train(Species~., data= iris, method = "rf", ntree = 10,
>> keep.inbag = TRUE, trControl = irisTrControl)
> Fitting: mtry=2
> Fitting: mtry=3
> Fitting: mtry=4
>> head(rf.iris.obj$finalModel$inbag,20)
> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> [1,] 1 0 1 0 0 0 1 0 1 1
> [2,] 1 1 1 1 1 0 1 0 1 0
> [3,] 1 1 1 0 0 1 1 0 0 0
> [4,] 1 0 1 0 1 1 0 1 0 1
> [5,] 0 1 1 1 1 1 0 1 0 1
> [6,] 1 1 0 1 0 0 1 1 1 0
> [7,] 1 1 0 0 1 1 0 0 0 0
> [8,] 1 1 1 1 1 0 1 1 1 1
> [9,] 1 1 0 1 0 1 0 1 1 0
> [10,] 1 1 1 0 1 1 0 0 0 1
> [11,] 1 1 1 1 1 1 1 0 1 0
> [12,] 1 1 1 1 1 0 1 0 1 1
> [13,] 1 0 1 1 1 1 1 1 0 1
> [14,] 0 1 1 1 0 1 0 0 0 0
> [15,] 1 1 1 1 1 1 1 1 1 0
> [16,] 1 1 0 0 0 0 1 0 1 1
> [17,] 1 0 1 0 0 0 1 1 0 1
> [18,] 1 0 1 1 1 1 1 1 1 1
> [19,] 1 0 1 0 1 1 1 0 1 1
> [20,] 1 0 1 0 1 1 1 0 1 0
>
> My understanding is the 1st tree in the RF should be built with
> tmpIrisIndex[1] i.e. "1 2 3 7 10 11 12 13 ..." ?
> But the Inbag in the resulting forest is showing it is using "1 2 3 4 6 7 8
> 9..." for inbag in 1st tree?
>
> Why the index passed to train does not match what got from inbag in the rf
> object? Or I had looked to the wrong place to check this?
>
> Any help / comments would be appreciated. Thanks a lot.
>
> Regards,
> Coll
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Strata-tp2295731p2303958.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Max
More information about the R-help
mailing list