[R] Factors attribute format

Mon Mar 22 22:15:26 CET 2010

On Mar 22, 2010, at 2:00 PM, rkevinburton at charter.net wrote:

> Thanks to Marc Schultz I found the documentation on the "factors" attribute under ?term.object. It stats:

<cough>   ;-)

> factors: A matrix of variables by terms showing which variables appear
>          in which terms.  The entries are 0 if the variable does not
>          occur in the term, 1 if it does occur and should be coded by
>          contrasts, and 2 if it occurs and should be coded via dummy
>          variables for all levels (as when an intercept or lower-order
>          term is missing).  If there are no terms other than an
>          intercept and offsets, this is ‘numeric(0)’.

The key is 'dummy variables for *all* levels'. In other words your example below of 12 months, would be represented by 12 individual binary (0/1) encodings, rather than, for example using default treatment contrasts, 11 individual binary (0/1) encodings, where the base or reference level is not included in the resultant model matrix.

I have not spent a lot of time on this internal R/S model design point, but in rather simple cases as an example, a '2' will appear in the presence of interaction terms lacking the main effects term for the second factor:

> attr(terms(y ~ x + z), "factors")
  x z
y 0 0
x 1 0
z 0 1

> attr(terms(y ~ x + x:z), "factors")
  x x:z
y 0   0
x 1   2
z 0   1

Compare the second example above with the more common:

> attr(terms(y ~ x * z), "factors")
  x z x:z
y 0 0   0
x 1 0   1
z 0 1   1

which is of course equivalent to:

> attr(terms(y ~ x + z + x:z), "factors")
  x z x:z
y 0 0   0
x 1 0   1
z 0 1   1

The difference in the encodings will be reflected in the model matrix. See ?model.matrix and play around with the examples there, including adding interaction terms. For example, model.matrix( ~ a + a:b, dd), etc.

This discussion leads into the complex issue of the internal representation of R (and S) models. If you really want to dig deeper, then you should get a copy of "Statistical Models in S" by Chambers and Hastie 1993 (aka "The White Book") and specifically note the rule described on the bottom of page 38 therein, perhaps pre-reading the entire chapter leading up to that particular point.

HTH,

Marc

> So now this brings up another question. It seems that the attriute is a two dimentional array. When I print it out in 'R' 
> 
> Fitting the formula prestige ~ income + education I get:
> 
>          income education
> prestige       0         0
> income         1         0
> education      0         1
> 
> This matrix says to me that 'income' occurs in the term 'income' etc. So it seems that this matrix will always be a diagonal matrix with an added row of zeros containing the response term. If the formula is such that the response is a function of one or more of the dependent variables then of course it will be something other that a row of zeros. So far OK?
> 
> My problem in understanding comes with using a formula that contains R factors. I am using the following (from the TSA package)  for an example:
> 
> l <- lm(tempdub ~ season(tempdub))
> attr(l$terms, "factors")
> 
>                season(tempdub)
> tempdub                       0
> season(tempdub)               1
> 
> The function 'season' produces a factor (in this case with 12 levels, one for each month). But the factor attribute still has a '1' and not a '2' indicating that the variable should be coded as a dummy variable (factor).
> 
> Please help my misunderstanding.
> 
> Thank you.
> 
> Kevin Burton