[R] A doubt about "lm" and the meaning of its summary

Mon Aug 18 20:31:17 CEST 2008

I have a conceptual problem (sort of). Maybe there's a simple solution, 
maybe not.

First, let me explain the test case that goes ok.

Let x be a (fixed) n-dimensional vector. I simulate a lot of linear models, 
y = m x + c + error, then I do a lot of regressions. As expected, the 
estimated values of m (let's call them m.bar) are distributed according to a 
Student's t distribution.

More precisely (test case, it takes a few minutes to run):
#
# start with fixed numbers
#
  m <- sample(c(-1, -0.1, 0.1, 1), 1)
  c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
  sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
  n <- sample(c(4,5,10,1000), 1)
  x <- 1:n # it's possible to use other x
  NN <- sample(c(3000, 4000, 5000), 1)
  m.bar <- m.std.error <- 0  # these vectors will hold the simulation output

#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
  for (i in 1:NN) {
    y <- m * x + c + sigma * rnorm(n)
    reg <- lm(y ~ x)
    m.bar[i] <- summary(reg)$coefficients['x', 'Estimate']
    m.std.error[i] <- summary(reg)$coefficients['x', 'Std. Error']
  }
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error is
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
  print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))

Now, this goes ok, with ks.test returning a number uniformly distributed in 
the 0-1 interval.

Next, I did a slightly different model. This model is a reversion model, 
where the simulated random variable lp goes along the equation:
lp[t + 1] <- (1 + m) * lp[t] + c + error

I tried to use the same method as above, defining 
x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation
y <- m * x + c + error

And now it breaks. Test case (it takes some minutes to run):

#
# start with fixed numbers
#
  m <- runif(1, -1, 0)  # m must be something in the (-1,0) interval
  c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
  sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
  n <- sample(c(4,5,10,1000), 1)
  NN <- sample(c(3000, 4000, 5000), 1)
  m.bar <- m.std.error <- 0  # these vectors will hold the simulation output

#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
  for (i in 1:NN) {
    lp <- 0
    lp[1] <- 0
    for (j in 1:n) {
      lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1)
    }
    x <- lp[1:n]
    y <- lp[-1] - x
    reg <- lm(y ~ x)
    m.bar[i] <- summary(reg)$coefficients['x', 'Estimate']
    m.std.error[i] <- summary(reg)$coefficients['x', 'Std. Error']
  }
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error should be
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
  print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))

... and now it's not.

What's wrong with this? I suspect that the model y ~ x does only give 
meaningful values when x is deterministic; in this case x is stochastic. Is 
there any correct way to estimate this model giving meaningful outputs?

Alberto Monteiro