[R] subset selection for logistic regression
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Thu Mar 3 00:23:49 CET 2005
Christian Hennig wrote:
> Perhaps I should not write it because I will discredit myself with this
> but...
>
> Suppose I have a setup with 100 variables and some 1000 cases and I want to
> boil down the number of variables to a maximum of 10 for practical reasons
> even if I lose 10% prediction quality by this (for example because it is
> expensive to measure all variables on new cases).
>
> Is it really so wrong to use a stepwise method?
Yes. Read about model uncertainty and bias in models developed using
stepwise methods. One exception: if there is a large number of
variables with truly zero regression coefficients, and the rest are not
very weak, stepwise can sort things out fairly well. But you never know
this in advance.
> Let's say I divide the sample into three parts and do variable selction on
> the first part, estimation on the second and test on the third part (this
> solves almost all problems Frank is talking about on p. 56/57 in his
> excellent book). Is there always a tractable alternative?
That's a good way to find out how bad the method is, not to fix the
problems inherent in it.
>
> Of course it is wrong to interpret the selected variables as "the true
> influences" and all others as "unrelated", but if I don't do that?
>
> If it should really be a taboo to do stepwise variable selection, why are p.
> 58/59 of "Regression Modeling Strategies" devoted to "how to do it of you
> must"?
Stress on "if". And note that if you ask what is the optimum alpha for
variables to be kept in the model when doing backwards stepdown, it's
alpha=1.0. A good compromise is alpha=0.5. See
@Article{ste01pro,
author = {Steyerberg, Ewout W. and Eijkemans, Marinus
J. C. and Harrell, Frank E. and Habbema, J. Dik F.},
title = {Prognostic modeling with logistic regression
analysis: {In} search of a sensible strategy in small data sets},
journal = Medical Decision Making,
year = 2001,
volume = 21,
pages = {45-56},
annote = {shrinkage; variable selection; dichotomization of
continuous varibles; sign of regression coefficient; calibration;
validation}
}
And on Bert's excellent question about why shrinkage is not used more
often, here is our attempt at a remedy:
@Article{moo04pen,
author = {Moons, K. G. M. and Donders, A. Rogier T. and
Steyerberg, E. W. and Harrell, F. E.},
title = {Penalized maximum likelihood estimation to directly
adjust diagnostic and prognostic prediction models for overoptimism: a
clinical example},
journal = J Clinical Epidemiology,
year = 2004,
volume = 57,
pages = {1262-1270},
annote = {prediction
research;overoptimism;overfitting;penalization;bootstrapping;shrinkage}
}
Frank
>
> Please forget my name;-)
>
> Christian
>
> On Wed, 2 Mar 2005, Berton Gunter wrote:
>
>
>>To clarify Frank's remark ...
>>
>>A prominent theme in statistical research over at least the last 25 years
>>(with roots that go back 50 or more, probably) has been the superiority of
>>"shrinkage" methods over variable selection. I also find it distressing that
>>these ideas have apparently not penetrated much (at all?) into the wider
>>scientific community (but I suppose I shouldn't be surprised -- most
>>scientists still do one factor at a time experiments 80 years after Fisher).
>>Specific incarnations can be found in anything Bayesian, mixed effects
>>models for repeated measures, ridge regression, and the R packages lars and
>>lasso, among others.
>>
>>I would speculate that aside from the usual statistics/science cultural
>>issues, part of the reason for this is that the estimators don't generally
>>come with neat, classical inference procedures: like it or not, many
>>scientists have been conditioned by their Stat 101 courses to expect P
>>values, so in some sense, we are hoisted by our own petard.
>>
>>Just my $.02 -- contrary(and more knowledgeable) opinions welcome.
>>
>>-- Bert Gunter
>>
>>
>>
>>>-----Original Message-----
>>>From: r-help-bounces at stat.math.ethz.ch
>>>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Frank
>>>E Harrell Jr
>>>Sent: Wednesday, March 02, 2005 5:13 AM
>>>To: Wittner, Ben
>>>Cc: r-help at lists.R-project.org
>>>Subject: Re: [R] subset selection for logistic regression
>>>
>>>Wittner, Ben wrote:
>>>
>>>>R-packages leaps and subselect implement various methods of
>>>
>>>selecting best or
>>>
>>>>good subsets of predictor variables for linear regression
>>>
>>>models, but they do
>>>
>>>>not seem to be applicable to logistic regression models.
>>>>
>>>>Does anyone know of software for finding good subsets of
>>>
>>>predictor variables for
>>>
>>>>linear regression models?
>>>>
>>>>Thanks.
>>>>
>>>>-Ben
>>>
>>>Why are these procedures still being used? The performance
>>>is known to
>>>be bad in almost every sense (see r-help archives).
>>>
>>>Frank Harrell
>>>
>>>
>>>>
>>>>p.s., The leaps package references "Subset Selection in
>>>
>>>Regression" by Alan
>>>
>>>>Miller. On page 2 of the
>>>>2nd edition of that text it states the following:
>>>>
>>>> "All of the models which will be considered in this
>>>
>>>monograph will be linear;
>>>
>>>>that is they
>>>> will be linear in the regression coefficients.Though
>>>
>>>most of the ideas and
>>>
>>>>problems carry
>>>> over to the fitting of nonlinear models and generalized
>>>
>>>linear models
>>>
>>>>(particularly the fitting
>>>> of logistic relationships), the complexity is greatly increased."
>>>
>>>
>>>--
>>>Frank E Harrell Jr Professor and Chair School of Medicine
>>> Department of Biostatistics
>>>Vanderbilt University
>>>
>>>______________________________________________
>>>R-help at stat.math.ethz.ch mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide!
>>>http://www.R-project.org/posting-guide.html
>>>
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>
>
>
> ***********************************************************************
> Christian Hennig
> Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
> hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
>>From 1 April 2005: Department of Statistical Science, UCL, London
> #######################################################################
> ich empfehle www.boag-online.de
>
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list