[R] Coefficients of Logistic Regression from bootstrap - how to get them?

Michal Figurski figurski at mail.med.upenn.edu
Thu Jul 31 19:36:12 CEST 2008


Summarizing things I don't understand:
  - Honestly, I was thinking I can use bootstrap to obtain better
estimate of a mean - provided that I want it. So, I can't?
  - If I can't obtain reliable estimates of CI and variance from a small
dataset, but I can do it with bootstrap - isn't it a "virtual increase"
of the size of dataset? OK, these are just words, I won't fight for that.
  - I don't understand why a procedure works for 26 models and doesn't
work for one... Intuitively this doesn't make sense...
  - I don't understand why resampling *cannot* improve... while it does?
I know the proof is going to be hard to follow, but let me try! (The
proof of the opposite is in the paper).
  - I truly don't understand what I don't understand about what I am
doing. This is getting too much convoluted for me...

And a remark about what I don't agree with Gustaf:

The text below, quoted from Pawinski et al ("Twenty six..."), is missing
an important information - that they repeated that step 50 times - each
time with "randomly selected subset". Excuse my ignorance again, but
this looks like bootstrap (re-sampling), doesn't it? Although I won't
argue for names.

I want to assure everyone here that I did *exactly* what they did. I
work in the same lab, that this paper came from, and I just had their
procedure in SPSS translated to SAS. Moreover, the translation was done
with help of a _trustworthy biostatistician_ - I was not that good with
SAS at the time to do it myself. The biostatistician wrote the
randomization and regression subroutines. I later improved them using
macros (less code) and added validation part. It was then approved by
that biostatistician.
OK, I did not exactly do the same, because I repeated the step 100 times
for 34 *pre-defined* models and on a different dataset. But that's about
all the difference.

I hope this solves everyone's dilemma whether I did what is described in
Pawinski's paper or not.

This discussion, though, started with my question on: how to do it in R,
instead of SAS, and with logistic (not linear) regression. Thank you,
Gustaf, for the code - this was the help I needed.

Michal J. Figurski

Gustaf Rydevik wrote:

> " For example, in here, the statistical estimator is  the sample mean.
> Using bootstrap sampling, you can do beyond your statistical
> estimators. You can now get even the distribution of your estimator
> and the statistics (such as confidence interval, variance) of your
> estimator."
> Again you are misinterpreting text. The phrase about "doing beyond
> your statistical estimators", is explained in the next sentence, where
> he says that using bootstrap gives you information about the mean
> *estimator* (and not more information about the population mean).
> And since you're not interested in this information, in your case
> bootstrap/resampling is not useful at all.
> As another example of misinterpretation: In your email from  a week
> ago, it sounds like you believe that the authors of the original paper
> are trying to improve on a fixed model
> Figurski:
> "Regarding the "multiple stepwise regression" - according to the cited
> SPSS manual, there are 5 options to select from. I don't think they used
> 'stepwise selection' option, because their models were already
> pre-defined. Variables were pre-selected based on knowledge of
> pharmacokinetics of this drug and other factors. I think this part I
> understand pretty well."
> This paragraph is wrong. Sorry, no way around it.
> Quoting from the paper Pawinski etal:
> "  *__Twenty-six____(!)*     1-, 2-, or 3-sample estimation
> models were fit (r2  0.341� 0.862) to a randomly
> selected subset of the profiles using linear regression
> and were used to estimate AUC0�12h for the profiles not
> included in the regression fit, comparing those estimates
> with the corresponding AUC0�12h values, calculated
> with the linear trapezoidal rule, including all 12
> timed MPA concentrations. The 3-sample models were
> constrained to include no samples past 2 h."
> (emph. mine)
> They clearly state that they are choosing among 26 different models by
> using their bootstrap-like procedure, not improving on a single,
> predefined model.
> This procedure is statistically sound (more or less at least), and not
> controversial.
> However, (again) what you are wanting to do is *not* what they did in
> their paper!
> resampling can not improve on the performance of a pre-specified
> model. This is intuitively obvious, but moreover its mathematically
> provable! That's why we're so certain of our standpoint. If you really
> wish, I (or someone else) could write out a proof, but I'm unsure if
> you would be able to follow.
> In the end, it doesn't really matter. What you are doing amounts to
> doing a regression 50 times, when once would suffice. No big harm
> done, just a bit of unnecessary work. And proof to a statistically
> competent reviewer that you don't really understand what you're doing.
> The better option would be to either study some more statistics
> yourself, or find a statistician that can do your analysis for you,
> and trust him to do it right.
> Anyhow, good luck with your research.
> Best regards,
> Gustaf

More information about the R-help mailing list