[R] Multiplication of regression coefficient by factor type variable
Bert Gunter
bgunter@4567 @ending from gm@il@com
Sat Nov 17 22:58:26 CET 2018
You shouldn't have to do any of what you are doing.
See ?predict.lm and note the "newdata" argument.
Also, you should spend some time studying a linear model text, as your
question appears to indicate some basic confusion (e.g. about
"contrasts" ) about how they work.
Bert Gunter
"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, Nov 17, 2018 at 1:24 PM Julian Righ Sampedro
<julian.righ.sampedro using gmail.com> wrote:
>
> Dear all,
>
> In a context of regression, I have three regressors, two of which are
> categorical variables (sex and education) and have class 'factor'.
>
> y = data$income
> x1 = as.factor(data$sex) # 2 levels
> x2 = data$age # continuous
> x3 = as.factor(data$ed) # 8 levels
>
> for example, the first entries of x3 are
>
> head(x3)[1] 5 3 5 5 4 2
> Levels: 1 2 3 4 5 6 7 8
>
> When we call the model, the output looks like this
>
> model1=lm(y ~ x1 + x2 + x3, data = data)
> summary(model1)
>
> Residuals:
> Min 1Q Median 3Q Max -31220 -6300 -594 4429 190731
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|) (Intercept) 1440.66
> 3809.99 0.378 0.705417
> x1 -4960.88 772.96 -6.418 2.13e-10 ***
> x2 181.45 25.03 7.249 8.41e-13 ***
> x32 2174.95 3453.22 0.630 0.528948
> x33 7497.68 3428.94 2.187 0.029004 *
> x34 8278.97 3576.30 2.315 0.020817 *
> x35 13686.88 3454.93 3.962 7.97e-05 ***
> x36 15902.92 4408.49 3.607 0.000325 ***
> x37 28773.13 3696.77 7.783 1.76e-14 ***
> x38 31455.55 5448.11 5.774 1.03e-08 ***---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 12060 on 1001 degrees of freedom
> Multiple R-squared: 0.2486, Adjusted R-squared: 0.2418
> F-statistic: 36.79 on 9 and 1001 DF, p-value: < 2.2e-16
>
> Now suppose I want to compute the residuals. To do so I first need to
> compute the prediction by the model. (I use it in a cross validation
> context so it is a partial display of the code)
>
> yhat1 = model1$coef[1] + model1$coef[2]*x1[i] + model1$coef[3]*x2[i] +
> model1$coef[4]*x3[i]
>
> But I get the following warnings
>
> Warning messages:1: In Ops.factor(model1$coef[2], x1[i]) : ‘*’ not
> meaningful for factors2: In Ops.factor(model1$coef[4], x3[i]) : ‘*’
> not meaningful for factors
>
> 1st question: Is there a way to multiply the coefficient by the 'factor'
> without having to transform my 'factor' into a 'numeric' type variable ?
>
> 2nd question: Since x3 is associated with 7 parameters (one for x32, one
> for x33, ... , one for x38), how do I multiply the 'correct' parameter
> coefficient with my 'factor' x3 ?
>
> I have been considering a 'if then' solution, but to no avail. I also have
> considered splitting my x3 variable into 8 binary variables without
> succeeding. What may be the best approach ? Thank you for your help.
>
> Since I understand this my not be specific enough, I add here the complete
> code
>
> # for n-fold cross validation# fit models on leave-one-out samples
> x1= as.factor(data$sex)
> x2= data$age
> x3= as.factor(data$ed)
> yn=data$income
> n = length(yn)
> e1 = e2 = numeric(n)
>
> for (i in 1:n) {
> # the ith observation is excluded
> y = yn[-i]
> x_1 = x1[-i]
> x_2 = x2[-i]
> x_3 = x3[-i]
> x_4 = as.factor(cf4)[-i]
> # fit the first model without the ith observation
> J1 = lm(y ~ x_1 + x_2 + x_3)
> yhat1 = J1$coef[1] + J1$coef[2]*x1[i] + J1$coef[3]*x2[i] + J1$coef[4]*x3[i]
> # construct the ith part of the loss function for model 1
> e1[i] = yn[i] - yhat1
> # fit the second model without the ith observation
> J2 = lm(y ~ x_1 + x_2 + x_3 + x_4)
> yhat2=J2$coef[1]+J2$coef[2]*x1[i]+J2$coef[3]*x2[i]+J2$coef[4]*x3[i]+J2$coef[5]*cf4[i]
> e2[i] = yn[i] - yhat2
> }
> sqrt(c(mean(e1^2),mean(e2^2))) # RMSE
>
> cf4 is a variable corresponding to groups (using mixtures) . What we want
> to demonstrate is that the prediction error after cross-validation is lower
> when we include this latent grouping variable. It works wonders when the
> categorical variables are treated as 'numeric' variables. Though the ols
> estimates are obviously very different.
>
> Thank you in advance for your views on the problem.
>
> Best Regards,
>
> julian
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list