[R] Factors attribute format
Marc Schwartz
marc_schwartz at me.com
Mon Mar 22 22:15:26 CET 2010
On Mar 22, 2010, at 2:00 PM, rkevinburton at charter.net wrote:
> Thanks to Marc Schultz I found the documentation on the "factors" attribute under ?term.object. It stats:
<cough> ;-)
> factors: A matrix of variables by terms showing which variables appear
> in which terms. The entries are 0 if the variable does not
> occur in the term, 1 if it does occur and should be coded by
> contrasts, and 2 if it occurs and should be coded via dummy
> variables for all levels (as when an intercept or lower-order
> term is missing). If there are no terms other than an
> intercept and offsets, this is ‘numeric(0)’.
The key is 'dummy variables for *all* levels'. In other words your example below of 12 months, would be represented by 12 individual binary (0/1) encodings, rather than, for example using default treatment contrasts, 11 individual binary (0/1) encodings, where the base or reference level is not included in the resultant model matrix.
I have not spent a lot of time on this internal R/S model design point, but in rather simple cases as an example, a '2' will appear in the presence of interaction terms lacking the main effects term for the second factor:
> attr(terms(y ~ x + z), "factors")
x z
y 0 0
x 1 0
z 0 1
> attr(terms(y ~ x + x:z), "factors")
x x:z
y 0 0
x 1 2
z 0 1
Compare the second example above with the more common:
> attr(terms(y ~ x * z), "factors")
x z x:z
y 0 0 0
x 1 0 1
z 0 1 1
which is of course equivalent to:
> attr(terms(y ~ x + z + x:z), "factors")
x z x:z
y 0 0 0
x 1 0 1
z 0 1 1
The difference in the encodings will be reflected in the model matrix. See ?model.matrix and play around with the examples there, including adding interaction terms. For example, model.matrix( ~ a + a:b, dd), etc.
This discussion leads into the complex issue of the internal representation of R (and S) models. If you really want to dig deeper, then you should get a copy of "Statistical Models in S" by Chambers and Hastie 1993 (aka "The White Book") and specifically note the rule described on the bottom of page 38 therein, perhaps pre-reading the entire chapter leading up to that particular point.
HTH,
Marc
> So now this brings up another question. It seems that the attriute is a two dimentional array. When I print it out in 'R'
>
> Fitting the formula prestige ~ income + education I get:
>
> income education
> prestige 0 0
> income 1 0
> education 0 1
>
> This matrix says to me that 'income' occurs in the term 'income' etc. So it seems that this matrix will always be a diagonal matrix with an added row of zeros containing the response term. If the formula is such that the response is a function of one or more of the dependent variables then of course it will be something other that a row of zeros. So far OK?
>
> My problem in understanding comes with using a formula that contains R factors. I am using the following (from the TSA package) for an example:
>
> l <- lm(tempdub ~ season(tempdub))
> attr(l$terms, "factors")
>
> season(tempdub)
> tempdub 0
> season(tempdub) 1
>
> The function 'season' produces a factor (in this case with 12 levels, one for each month). But the factor attribute still has a '1' and not a '2' indicating that the variable should be coded as a dummy variable (factor).
>
> Please help my misunderstanding.
>
> Thank you.
>
> Kevin Burton
More information about the R-help
mailing list