[R] Linear regression with a rounded response variable

Wed Oct 21 19:57:14 CEST 2015

On Wed, 21 Oct 2015, Ravi Varadhan wrote:

> Hi, I am dealing with a regression problem where the response variable, 
> time (second) to walk 15 ft, is rounded to the nearest integer.  I do 
> not care for the regression coefficients per se, but my main interest is 
> in getting the prediction equation for walking speed, given the 
> predictors (age, height, sex, etc.), where the predictions will be real 
> numbers, and not integers.  The hope is that these predictions should 
> provide unbiased estimates of the "unrounded" walking speed. These 
> sounds like a measurement error problem, where the measurement error is 
> due to rounding and hence would be uniformly distributed (-0.5, 0.5).
>

Not the usual "measurement error model" problem, though, where the errors 
are in X and not independent of XB.

Look back at the proof of the unbiasedness of least squares under the 
Gauss-Markov setup. The errors in Y need to have expectation zero.

>From your description (but see caveat below) this is true of walking 
*time*, but not not exactly true of walking *speed* (modulo the usual 
assumptions if they apply to time). In fact if E(epsilon) = 0 were true of 
unrounded time, it would not be true of unrounded speed (and vice versa).

> Are there any canonical approaches for handling this type of a problem?

Work out the bias analytically? Parametric bootstrap? Data augmentation 
and friends?

> What is wrong with just doing the standard linear regression?
>

Well, what do the actual values look like?

If half the subjects have a value of 5 seconds and the rest are split 
between 4 and 6, your assertion that rounding induces an error of 
dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 
second group and more negative errors in the 4 second group under any 
plausible model).

HTH,

Chuck