[R] Questions on factors in regression analysis

Thu Aug 20 21:53:19 CEST 2009

On Aug 20, 2009, at 3:42 PM, guox at ucalgary.ca wrote:

> Thanks!
>>
>> On Aug 20, 2009, at 1:46 PM, guox at ucalgary.ca wrote:
>>
>>> I got two questions on factors in regression:
>>>
>>> Q1.
>>> In a table, there a few categorical/factor variables, a few  
>>> numerical
>>> variables and the response variable is numeric. Some factors are
>>> important
>>> but others not.
>>> How to determine which categorical variables are significant to the
>>> response variable?
>>
>> Seems that you should engage the services of a consulting  
>> statistician
>> for that sort of question. Or post in a venue where statistical
>> consulting is supposed to occur, such as one of the sci.stat.*
>> newsgroups.
>
> I googled sci.stat.* and got sci.stat.math and sci.stat.consult.
> Are they good?

The quality of responses varies. You may get what you pay for. On the  
other hand sometimes you get high-quality advice for free.

> I have no idea to do this. So any clue will be appreciated.

http://groups.google.com/?hl=en

>
>>
>>>
>>> Q2.
>>> As we knew, lm can deal with categorical variables.
>>> I thought, when there is a categorical predictor, we may use lm
>>> directly
>>> without quantifying these factors and assigning different values to
>>> factors
>>> would not change the fittings as shown:
>>
>> The "numbers" that you are attempting to assign are really just  
>> labels
>> for the factor levels. The regression functions in R will not use  
>> them
>> for any calculations. They should not be thought of as having
>> "values". Even if the factor is an ordered factor, the labels may not
>> be interpretable as having the same numerical order as the string
>> values might suggest.
>>
>>>
>>> x <- 1:20 ## numeric predictor
>>> yes.no <- c("yes","no")
>>> factors <- gl(2,10,20,yes.no) ##factor predictor
>>> factors.quant <-  rep(c(18.8,29.9),c(10,10)) ##quantificatio of
>>> factors
>>
>> Not sure what that is supposed to mean. It is not a factor object  
>> even
>> though you may be misleading yourself in to believing it should be.
>> It's a numeric vector.
>
> Yes, levels are not numeric but just labels. But
> after the levels factors being assigned to numeric values as  
> factors.quant
> and factors.quant.1,
> lm(response ~ x + factors.quant) and lm(response ~ x + factors.quant1)
> produced the same fitted curve as lm(response ~ x + factors). This  
> is what
> I could not understand.

In for the factor variable case and the numeric variable case there  
was no variation in the predictor variable within a level. So the  
predictions will all be the same within levels in each case. There  
will be differences in the coefficients arrived at to achieve that  
result, however.

>
>>> str(factors.quant)
>>  num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ...
>>
>>> factors.quant.1 <-  rep(c(16.9,38.9),c(10,10))
>>>  ##second quantificatio of factors
>>> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response
>>> lm.quant <- lm(response ~ x + factors.quant) ##lm with  
>>> quantifications
>>> lm.fact <- lm(response ~ x + factors) ##lm with factors
>>
>>> lm.quant
>>
>> Call:
>> lm(formula = response ~ x + factors.quant)
>>
>> Coefficients:
>>   (Intercept)              x  factors.quant
>>       14.9098         0.5385         1.2350
>>
>>> lm.fact
>>
>> Call:
>> lm(formula = response ~ x + factors)
>>
>> Coefficients:
>> (Intercept)            x    factorsno
>>     38.1286       0.5385      13.7090
>>>
>>> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with
>>> quantifications
>>
>>> lm.quant.1
>>
>> Call:
>> lm(formula = response ~ x + factors.quant.1)
>>
>> Coefficients:
>>     (Intercept)                x  factors.quant.1
>>         27.5976           0.5385           0.6231
>>
>>> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors
>>>
>>> par(mfrow=c(2,2)) ## comparisons of two fittings
>>> plot(x, response)
>>> lines(x,fitted(lm.quant),col="blue")
>>> grid()
>>> plot(x,response)
>>> lines(x,fitted(lm.fact),col = "red")
>>> grid()
>>> plot(x, response)
>>> lines(x,fitted(lm.quant.1),lty =2,col="blue")
>>> grid()
>>> plot(x,response)
>>> lines(x,fitted(lm.fact.1),lty =2,col = "red")
>>> grid()
>>> par(mfrow = c(1,1))
>>>
>>> So, is it right that we can assign any numeric values to factors,
>>> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the above,
>>> before doing lm, glm, aov, even nls?
>>
>> You can give factor levels any name you like, including any sequence
>> of digit characters. Unlike "ordinary R where unquoted numbers cannot
>> start variable names, factor functions will coerce numeric vectors to
>> character vectors when assigning level names. But you seem to be
>> conflating factors with numeric vectors that have many ties. Those  
>> two
>> entities would have different handling by R's regression functions.

>> -- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT