[R] How to properly build model matrices

Uwe Ligges ligges at statistik.tu-dortmund.de
Sat Feb 11 19:25:12 CET 2012

On 09.02.2012 22:39, Yang Zhang wrote:
> I always bump into a few (very minor) problems when building model
> matrices with e.g.:
> train = model.matrix(label~., read.csv('train.csv'))
> target = model.matrix(label~., read.csv('target.csv'))
> (1) The two may have different factor levels, yielding different
> matrices.  I usually first rbind the data frames together to "meld"
> the factors, and then split them apart and matrixify them.

You can preprocess the data and explicitly define the levels for factor 
variables in your data.frames.

> (2) The target set that I'm predicting on typically doesn't have
> labels.  I usually manually append dummy labels to the target data
> frame.

R cannot know labels if you do not provide any.

> (3) I almost always remove the Intercept from the model matrices,
> since it seems to always be redundant (I usually use caret).

Then change your model formula to: "label ~ . - 1". But note the 
interpretation changes and it is *not* redundant in general.

Uwe Ligges

> None of these is a big deal at all, but I'm just curious if I'm
> missing something simple in how I'm doing things.  Thanks.

More information about the R-help mailing list