[R] A doubt about "lm" and the meaning of its summary
Alberto Monteiro
albmont at centroin.com.br
Mon Aug 18 20:31:17 CEST 2008
I have a conceptual problem (sort of). Maybe there's a simple solution,
maybe not.
First, let me explain the test case that goes ok.
Let x be a (fixed) n-dimensional vector. I simulate a lot of linear models,
y = m x + c + error, then I do a lot of regressions. As expected, the
estimated values of m (let's call them m.bar) are distributed according to a
Student's t distribution.
More precisely (test case, it takes a few minutes to run):
#
# start with fixed numbers
#
m <- sample(c(-1, -0.1, 0.1, 1), 1)
c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
n <- sample(c(4,5,10,1000), 1)
x <- 1:n # it's possible to use other x
NN <- sample(c(3000, 4000, 5000), 1)
m.bar <- m.std.error <- 0 # these vectors will hold the simulation output
#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
for (i in 1:NN) {
y <- m * x + c + sigma * rnorm(n)
reg <- lm(y ~ x)
m.bar[i] <- summary(reg)$coefficients['x', 'Estimate']
m.std.error[i] <- summary(reg)$coefficients['x', 'Std. Error']
}
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error is
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
Now, this goes ok, with ks.test returning a number uniformly distributed in
the 0-1 interval.
Next, I did a slightly different model. This model is a reversion model,
where the simulated random variable lp goes along the equation:
lp[t + 1] <- (1 + m) * lp[t] + c + error
I tried to use the same method as above, defining
x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation
y <- m * x + c + error
And now it breaks. Test case (it takes some minutes to run):
#
# start with fixed numbers
#
m <- runif(1, -1, 0) # m must be something in the (-1,0) interval
c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
n <- sample(c(4,5,10,1000), 1)
NN <- sample(c(3000, 4000, 5000), 1)
m.bar <- m.std.error <- 0 # these vectors will hold the simulation output
#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
for (i in 1:NN) {
lp <- 0
lp[1] <- 0
for (j in 1:n) {
lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1)
}
x <- lp[1:n]
y <- lp[-1] - x
reg <- lm(y ~ x)
m.bar[i] <- summary(reg)$coefficients['x', 'Estimate']
m.std.error[i] <- summary(reg)$coefficients['x', 'Std. Error']
}
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error should be
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
... and now it's not.
What's wrong with this? I suspect that the model y ~ x does only give
meaningful values when x is deterministic; in this case x is stochastic. Is
there any correct way to estimate this model giving meaningful outputs?
Alberto Monteiro
More information about the R-help
mailing list