[R] Stepwise model selection question

John Thaden jjthaden at flash.net
Fri Jun 18 23:21:05 CEST 1999

     I use the step() function occasionally, and I think I understand its
objective, proper use, and limitations.  Now I see stepwise model selection
being used in what seems to be an unusual way, and I wonder if it is right
or wrong.  May I describe?

     Genetic mapping tries to find where in an animal's genome are genetic
elements that influence a particular physical trait.  Say there are 100
individuals derived from a cross between two parental strains, and for each
individual, we know which parent contributed each of 40 different locations
on the genome (mapping markers), and we have measured the trait.  The
markers are then binary predictors, the trait is the outcome.
     First, we could do 'single-marker analysis'.  We'd make simple F-tests
for the trait and each marker, separately, and rank markers by F-values.  A
high value suggests the mapping marker lies near a genetic element
affecting the trait. 
     As a second analysis, we could use a method called interval mapping.
It uses the same data to  "scan" the genome, including regions between
mapping markers, and produces plots with likelihood ratios or LOD scores on
the y-axis, and position along the genome on the x.  (It relies on mixture
regression models, since the parental contribution is unknown between
     As a third analysis, we could use a refinement called composite
interval mapping.  It includes a few of the 40 mapping markers as
additional cofactors in mixture regression formulae as one scans.  The idea
is to have cofactors to handle genetic elements with large effects while
scanning elsewhere in the genome.  Which markers to include as cofactors is
selected prior to the scanning phase, by stepwise multiple-regression model
selection (occasionally, I'm able to exhaustively compare all possible
models, but usually it is done by forward-backward algorithm).
     I'm OK with this so far.  The use of step() seems fairly standard.
But now here's where I think it gets weird:  There is a compulsion among
geneticists to then treat the results of the stepwise model selection as
yet a fourth analytical tool by which to rank all the mapping markers,
i.e., as further evidence that a marker must be near a genetic element
affecting the trait.  "If it's included in the model, it must be close to a
genetic effector of the trait".  How does this sound to you?  If a stepwise
algorithm ranks possible cofactors--perhaps even assigns them an F
value--can you use that ranking to make any comparisons among possible
cofactors?  What do the F values mean?
John J. Thaden, Ph.D., Instructor        jjthaden at life.uams.edu
Department of Geriatrics                     (501) 257-5583
University of Arkansas for Medical Sciences  FAX: (501) 257-4822
      mail & ship to:	J. L. McClellan V.A. Medical Center
		Research-151 (Room GB103 or GC124)
		4300 West 7th Street
		Little Rock AR 72205 USA
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list