[R] Discretize continous variables....

Frank E Harrell Jr f.harrell at vanderbilt.edu
Sun Jul 20 15:11:41 CEST 2008

Johannes Huesing wrote:
> Frank E Harrell Jr <f.harrell at vanderbilt.edu> [Sun, Jul 20, 2008 at 12:20:28AM CEST]:
>> Johannes Huesing wrote:
>>> Because regulatory bodies demand it? 
> [...]
>> And how anyway does this  
>> relate to predictors in a model?
> Not at all; you're correct. I was mixing the topic of this discussion
> up with another kind of silliness.
> I had a discussion with a biometrician in a pharmaceutical company
> though who stated that when you have only one df to spend it will be
> better to dichotomise it at a clinically meaningful point than to
> include it as a linear term. He kept the discussion on the ground of
> laboratory measurements like sodium, where a deviation from normal
> ranges is very significant (and unlike, say, cholesterol, where you
> have a gradual interpretation of the value). He has a point there, but
> in general the reason for sacrificing information is a mixture of
> laziness, the preference for presenting data in tables and to keep the
> modelling "consistent" with the tables (for instance to assign an odds
> ratio to each cell).

Nice points.  I think the desire to be able to present things in tables 
is a major reason.

The biometrician's idea that a piecewise flat line with one jump will 
fit a dataset better than a linear effect is quite a leap in logic.  If 
I only have one d.f. to spend I'll take linear any day, but better to 
spend a little more and fit a smooth nonlinear relationship.  A coherent 
approach is to shrink the fit down to the effective number of parameters 
the dataset will support estimating.

There is no clinical laboratory measure that has a jump discontinuity in 
its effect on mortality or other patient outcomes.  The fact that 
reference ranges exist (which are based only on supposedly normal 
subjects and don't related to the risk of an outcome) doesn't mean we 
should use them in formulated independent or dependent variables.

It is common but distorted logic to want to make an odds ratio in a 
model be comparable to one in a table from which regression coefficients 
were just anti-logged (so that 1-unit changes could be used).  The 
tabled odds ratio is a kind of crude population averaged odds ratio that 
may not apply to a single subject in the study.

My book has many examples where laboratory measurements are related to 
risk using restricted cubic splines.


Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

More information about the R-help mailing list