[R] Coefficients of Logistic Regression from bootstrap - how to get them?
Tim Hesterberg
TimHesterberg at gmail.com
Mon Jul 28 04:13:28 CEST 2008
I'll address the question of whether you can use the bootstrap to
improve estimates, and whether you can use the bootstrap to "virtually
increase the size of the sample".
Short answer - no, with some exceptions (bumping / Random Forests).
Longer answer:
Suppose you have data (x1, ..., xn) and a statistic ThetaHat,
that you take a number of bootstrap samples (all of size n) and
let ThetaHatBar be the average of those bootstrap statistics from
those samples.
Is ThetaHatBar better than ThetaHat? Usually not. Usually it
is worse. You have not collected any new data, you are just using the
existing data in a different way, that is usually harmful:
* If the statistic is the sample mean, all this does is to add
some noise to the estimate
* If the statistic is nonlinear, this gives an estimate that
has roughly double the bias, without improving the variance.
What are the exceptions? The prime example is tree models (random
forests) - taking bootstrap averages helps smooth out the
discontinuities in tree models. For a simple example, suppose that a
simple linear regression model really holds:
y = beta x + epsilon
but that you fit a tree model; the tree model predictions are
a step function. If you bootstrap the data, the boundaries of
the step function will differ from one sample to another, so
the average of the bootstrap samples smears out the steps, getting
closer to the smooth linear relationship.
Aside from such exceptions, the bootstrap is used for inference
(bias, standard error, confidence intervals), not improving on
ThetaHat.
Tim Hesterberg
>Hi Doran,
>
>Maybe I am wrong, but I think bootstrap is a general resampling method which
>can be used for different purposes...Usually it works well when you do not
>have a presentative sample set (maybe with limited number of samples).
>Therefore, I am positive with Michal...
>
>P.S., overfitting, in my opinion, is used to depict when you got a model
>which is quite specific for the training dataset but cannot be generalized
>with new samples......
>
>Thanks,
>
>--Jerry
>2008/7/21 Doran, Harold <HDoran at air.org>:
>
>> > I used bootstrap to virtually increase the size of my
>> > dataset, it should result in estimates more close to that
>> > from the population - isn't it the purpose of bootstrap?
>>
>> No, not really. The bootstrap is a resampling method for variance
>> estimation. It is often used when there is not an easy way, or a closed
>> form expression, for estimating the sampling variance of a statistic.
More information about the R-help
mailing list