[R] predict.lm point forecasts with factors

Wed Feb 14 22:24:51 CET 2007

On Wed, 2007-02-14 at 13:54 -0700, sj wrote:
> hello,
> 
> I am trying to use predict.lm to make point forecasts based on a model with
> continuous and categorical independent variables
> I have no problems fitting the model using lm, but when I try to use predict
> to make point predictions. it reverts back to the original dataframe and
> gives me the point predictions for the fitted data rather than for the new
> data, I imagine that I am missing something simple but for whatever reason I
> can't figure out why it does not like the new data and is reverting to the
> fitted data. The following code illustrates the problem I am running in to.
> Any help would be appreciated.
> 
> f1 <- rep(c("a","b","c","d"),25)
> f2 <- sample(rep(c("e","f","g","h"),250),100)
> x <- rnorm(100,100)
> y <- rnorm(100,150)
> 
> mdl <- lm(y~x+f1+f2)
> 
> f12 <-rep(c("a","b","c","d"),5)
> f22 <- sample(rep(c("e","f","g","h"),250),20)
> x2 <- rnorm(20,100)
> 
> new <- data.frame(cbind(f12[1],f22[1],x2[1]))
> 
> 
> predict(mdl,new)
> 
> 
> best,
> 
> Spencer

Spencer,

You have two distinct issues going on here:

The initial model that you create 'mdl' is based upon 'f1' and 'f2'
being created as character vectors, not as factors. While the modeling
functions will internally do the coercion, I do not believe that the
predict functions will. 

In fact, you should have noted the following error messages:

> mdl <- lm(y~x+f1+f2)
Warning messages:
1: variable 'f1' converted to a factor in: model.matrix.default(mt, mf,
contrasts) 
2: variable 'f2' converted to a factor in: model.matrix.default(mt, mf,
contrasts) 

So you end up with a 'class' conflict between the model frame object and
the new data object, since the latter will default to coercing 'f12' and
'f22' to factors.

Secondly, 'new' needs to have columns created with the SAME names as
those used in the original model.

Thus, a code sequence along the lines of the following should work:

f1 <- rep(c("a","b","c","d"), 25)
f2 <- sample(rep(c("e","f","g","h"), 250), 100)
x <- rnorm(100, 100)
y <- rnorm(100, 150)

# Create a data frame from the data so
# so that f1 and f2 become factors
DF <- data.frame(y, x, f1, f2)

mdl <- lm(y ~ x + f1 + f2, DF)

f12 <-rep(c("a","b","c","d"), 5)
f22 <- sample(rep(c("e","f","g","h"), 250), 20)
x2 <- rnorm(20, 100)

# Create 'new' in the same way, but naming the
# columns the same as 'DF" above
new <- data.frame(f1 = f12, f2 = f22, x = x2)

# Now run predict on the first row in 'new
> predict(mdl, new[1, ])
[1] 150.3273

The number you come up with should be different, since you are using
random data.

HTH,

Marc Schwartz