[BioC] expanding factors for lm

Mon Apr 18 17:55:24 CEST 2005

Hello,

I've a rather general question related to factors that I'd like to use in linear models (for siRNA design):

I have two nucleotid positions for a gene (say NC1 and NC2), and 'A', 'T', 'C', 'G' are the 4 possible values for each nucleotide. There's a normal distributed response I measure for some genes, and I'd like to know which nucleotide type (A,T,C or G) is significant at each position (NC1 and NC2 and possibly with interactions).

I see two possibilities for coding these factors:

Two factors NC1 and NC2 each with levels A T C and G

or 8 factors with levels 0 or 1 (i.e. boolean):

A1 T1 C1 G1 A2 T2 C2 and G2

Which solution would be most appropriate?

I think about two main differences between the two possibilities:

1. Interpretation of the results. The model with 2 factors and 4 levels might be more difficult to interpret since e.g. a treatment contrast would be the effect of 3 nucleotide types (T,C or G) relative to the first nucleotide ('A') whereas the model with 8 factors directly tells you that having this nucleotide/position (coded by 1) is significant or not. ANOVA would also be easier to interpret for the 8 factor model.

2. Different degrees of freedom (more factors with less DFs). When there are many nucleotide positions many factors need to be stimated by only relatively few measurements (the dependent variable).

Any comments and discussion is appreciated,

	kind regards,

	Arne