[R] Coefficients of Logistic Regression from bootstrap - how to get them?

Wed Jul 23 18:22:27 CEST 2008

Thank you all for your words of wisdom.

I start getting into what you mean by bootstrap. Not surprisingly, it 
seems to be something else than I do. The bootstrap is a tool, and I 
would rather compare it to a hammer than to a gun. People say that 
hammer is for driving nails. This situation is as if I planned to use it 
to break rocks.

The key point is that I don't really care about the bias or variance of 
the mean in the model. These things are useful for statisticians; 
regular people (like me, also a chemist) do not understand them and have 
no use for them (well, now I somewhat understand). My goal is very 
practical: I need an equation that can predict patient's outcome, based 
on some data, with maximum reliability and accuracy.

I have found from the mentioned paper (and from my own experience) that 
re-sampling and running the regression on re-sampled dataset multiple 
times does improve predictions. You have a proof of that in that paper, 
page 1502, and to me it is rather a stunning proof: compare 56% to 82% 
of correctly predicted values (correct means within 15% of original value).

I can understand that it's somewhat new for many of you, and some tried 
to discourage me from this approach (shooting my foot). This concept was 
devised by, I believe, Mr Michael Hale, a respectable biostatistician. 
It utilises bootstrap concept of resampling, though, after recent 
discussion, I think it should be called another name.

In addition to better predictive performance, using this concept I also 
get a second dataset with each iteration, that can be used for 
validation of the model. In this approach the validation data are 
accumulated throughout the bootstrap, and then used in the end to 
calculate log residuals using equation with median coefficients. I am 
sure you can question that in many ways, but to me this is as good as 
you can get.

To be more practical, I will ask the authors of this paper if I can post 
their original dataset in this forum (I have it somewhere) - if you guys 
think it's interesting enough. Then anyone of you could use it, follow 
the procedure, and criticize, if they wish.

--
Michal J. Figurski
HUP, Pathology & Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

S Ellison wrote:
> jeez, but you've kicked up a storm!
> 
> penn'orth on the bootstrap; and since I'm a chemist, you can ignore at
> will.
> 
> The bootstrap starts with your data and the model you developed with
> it. Resampling gives a fair idea of what the variance _around your
> current estimate_ is. But it cannot tell you how biased you are or
> improve your estimate, because there is no more information in your
> data. 
> 
> Toy example. Let's say I get some results from some measurement
> procedure, like this.
> 
> set.seed(408) #so we get the same random sample (!)
> 
> y<-rnorm(12,5) #OK, not a very convincing measurement, but....
> 
> #Now let's add a bit of bias
> y<-y+3
> 
> mean(y) #... is my (biased) estimate of the mean value.
> 
> #Now let's pretend I don't know the true answer OR the bias, which is
> what happens 
> #in the real world, and try bootsrapping. Let's get a rather generous 
> #10000 resamples from my data;
> 
> m<-matrix(sample(y, length(y)*10000, replace=T), ncol=length(y))
> #This gives me a matrix with 10000 rows, each of which is a resample 
> #of my 12 data. 
> 
> #And now we can calculate 10000 bootstrapped means in one shot:
> bs.mean<-apply(m,1,mean) #which applies 'mean' to each row.
> 
> #We hope the variance of these things is about 1/12, 'cos we got y from
> a normal distribution 
> #with var 1 and we had 12 of them.  let's see...
> var(bs.mean)
> 
> #which should resemble
> 1/12
> 
> #and does.. roughly. 
> #And for interest, compare with what we go direct from the data;
> var(y)/12
> #which in this case was slightly further from the 'true' variance. It
> won't always be, though; 
> #that depends on the data.
> 
> #Anyway, the bootstrap variance looks about right. So ... on to bias
> 
> #Now, where would we expect the bootstrapped mean value to be? 
> #At the true value, or where we started?
> mean(bs.mean)
> 
> #Oh dear. It's still biased. And it looks very much like the mean of
> y.
> #It's clearly told us nothing about the true mean.
> 
> #Bottom line; All you have is your data. Bootstrapping uses your data.
> 
> #Therefore, bootstrapping can tell you no more than you can get from
> your data.
> #But it's still useful if you have some rather more complicated
> statistic derived from 
> #a non-linear fit, because it lets you get some idea of the variance.
> #But not the bias.
> 
> This may be why some folk felt that your solution as worded (an
> ever-present peril, wording) was not an answer to the right question.
> The fitting procedure already gives you the 'best estimate' (where
> 'best' means max likelihood, this time), and bootstrapping really cannot
> improve on that. It can only start at your current 'best' and move away
> from it in a random direction.  That can't possibly improve the
> estimated coefficients. And the more you bootstrap, the closer the mean
> gets to where you started. 
> So "how does the bootstrap improve on that?" was a very pertinent
> question - to which the correct answer was "it can't - but it can
> suggest what the variance might be". 
> 
> As to whether you wanted advice on whether to bootstrap or not; well,
> it's an open forum and aid is voluntary. R help always generates at
> least three replies, one of which is "tell me more about the problem",
> one of which is "why are you doing it that way?" and one of which is
> "that is probably not the problem you should be trying to solve". On a
> good day you also get the one that goes "this might solve it".
> 
> Incidentally, the boot package and the simpleboot package both do
> bootstrapping; they might solve your problem. 
> 
> Then there's advice. Folk obviously can't impose unless you let them -
> but they do know a lot about statistics and if they say something is
> silly, it is at least worth finding out why so that you (and I, for that
> matter) can better defend our silliness. 
> Also, of course, if you see someone trying to do something silly - eg
> pull the trigger while the gun is pointed at their foot - would you
> really give them the instruction they asked for on how to get the safety
> catch off? Or tell them that what they are doing is silly? 
> (Me, well, it's their foot but if I help them, they may sue me later)
> 
> 
> If any of the above helps without sounding horribly patronising, I win.
> If not, well, you have another email to burn!
> 
> happy booting
> 
> Steve Ellison
> 
>>>> Michal Figurski <figurski at mail.med.upenn.edu> 22/07/2008 20:42 >>>
> 
> 
> *******************************************************************
> This email and any attachments are confidential. Any u...{{dropped:8}}