[R] [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]

Frank E Harrell Jr f.harrell at vanderbilt.edu
Thu Jul 24 20:38:49 CEST 2008

Michal Figurski wrote:
> Thank you Frank and all for your advices.
> Here I attach the raw data from the Pawinski's paper. I have obtained
> permission from the corresponding Author to post it here for everyone.
> The only condition of use is that the Authors retain ownership of the
> data, and any publication resulting from these data must be managed by 
> them.
> The dataset is composed as follows: patient number / MMF dose in [g] /
> Day of study (since start of drug administration) / MPA concentrations
> [mg/L] in plasma at following time points: 0, 0.5 ... 12 hours / and the
> value of AUC(0-12h) calculated using all time-points.
> The goal of the analysis, as you can read from the paper, was to
> estimate the value of AUC using maximum 3 time-points within 2 hours
> post dose, that is using only 3 of the 4 time-points: 0, 0.5, 1, 2 - but
> always include the "0" time-point.
> In my analysis of similar problem I was also concerned about the fact
> that data come from several visits of a single patient. I have examined
> the effect of "PT" with repeated "day" using mixed effects model, and
> these effects turned out to be insignificant. Do you guys think it is
> enough justification to use the dataset as if coming from 50 separate
> patients?

I don't think that is the way to assess it, but rather estimation the 
intra-subject correlation should be used.  Or compare variances from the 
cluster bootstrap and the ordinary bootstrap.

> Also, as to estimation of the bias, variance, etc, Pawinski used CI and
> Sy/x. In my analysis I additionally used RMSE values. Please excuse
> another naive question, but: do you think it is sufficient information
> to compare between models and account for bias?

RMSE is usually a good approch.

> Regarding the "multiple stepwise regression" - according to the cited
> SPSS manual, there are 5 options to select from. I don't think they used
> 'stepwise selection' option, because their models were already
> pre-defined. Variables were pre-selected based on knowledge of
> pharmacokinetics of this drug and other factors. I think this part I
> understand pretty well.
> I see the Frank's point about recalibration on Fig.2 - although the
> expectation was set that the prediction be within 15% of the original
> value. In my opinion it is *very strict* - I actually used 20% in my
> work. This is because of very high variability and imprecision in the
> results themselves. These are real biological data and you have to
> account for errors like analytical errors (HPLC method), timing errors
> and so on, when you look at these data. In other words, if you take two
> blood samples at each time-point from a particular patient, and run
> them, you will 100% certainly get two distinct (although similar)
> profiles. You will get even more difference, if you run one set of
> samples on one day, and another set on second day.
> Therefore the value of AUC(0-12) itself, to which we compare the
> predicted AUC, is not 'holy' - some variability here is inherent.
> Nevertheless, I see that the Fig.2 may be incorrect, if we look from
> orthodox statistical perspective. I used the same plots in my work as
> well - it's too late now. How should I properly estimate the Rsq then?

Validation Rsq is 1 - sum of squared errors / sum of squared total.


> I greatly appreciate your time and advices in this matter.
> -- 
> Michal J. Figurski
> Frank E Harrell Jr wrote:
>> Gustaf Rydevik wrote:
>>> On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski
>>> <figurski at mail.med.upenn.edu> wrote:
>>> Hi,
>>> I believe that you misunderstand the passage. Do you know what
>>> multiple stepwise regression is?
>>> Since they used SPSS, I copied from
>>> http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm 
>>> "Stepwise selection is a combination of forward and backward procedures.
>>> Step 1
>>> The first predictor variable is selected in the same way as in forward
>>> selection. If the probability associated with the test of significance
>>> is less than or equal to the default .05, the predictor variable with
>>> the largest correlation with the criterion variable enters the
>>> equation first.
>>> Step 2
>>> The second variable is selected based on the highest partial
>>> correlation. If it can pass the entry requirement (PIN=.05), it also
>>> enters the equation.
>>> Step 3
>>>> From this point, stepwise selection differs from forward selection:
>>> the variables already in the equation are examined for removal
>>> according to the removal criterion (POUT=.10) as in backward
>>> elimination.
>>> Step 4
>>> Variables not in the equation are examined for entry. Variable
>>> selection ends when no more variables meet entry and removal criteria.
>>> -----------
>>> It is the outcome of this *entire process*,step1-4, that they compare
>>> with the outcome of their *entire bootstrap/crossvalidation/selection
>>> process*, Step1-4 in the methods section, and find that their approach
>>> gives better result
>>> What you are doing is only step4 in the article's method
>>> section,estimating the parameters of a model *when you already know
>>> which variables to include*.It is the way this step is conducted that
>>> I am sceptical about.
>>> Regards,
>>> Gustaf
>> Perfectly stated Gustaf.  This is a great example of needing to truly 
>> understand a method to be able to use it in the right context.
>> After having read most of the paper by Pawinski et al now, there are 
>> other problems.
>> 1. The paper nowhere uses bootstrapping.  It uses repeated 2-fold 
>> cross-validation, a procedure not usually recommended.
>> 2. The resampling procedure used in the paper treated the 50 
>> pharmacokinetic profiles on 21 renal transplant patients as if these 
>> were from 50 patients.  The cluster bootstrap should have been used 
>> instead.
>> 3. Figure 2 showed the fitted regression line to the predicted vs. 
>> observed AUCs.  It should have shown the line of identify instead.  In 
>> other words, the authors allowed a subtle recalibration to creep into 
>> the analysis (and inverted the x- and y-variables in the plots).  The 
>> fitted lines are far enough away from the line of identity as to show 
>> that the predicted values are not well calibrated.  The r^2 values 
>> claimed by the authors used the wrong formulas which allowed an 
>> automatic after-the-fact recalibration (new overall slope and 
>> intercept are estimated in the test dataset).  Hence the achieved r^2 
>> are misleading.
> ------------------------------------------------------------------------
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

More information about the R-help mailing list