[R] Simple lm/regression question
Achim Zeileis
Achim.Zeileis at uibk.ac.at
Mon Feb 6 10:57:23 CET 2012
On Mon, 6 Feb 2012, James Annan wrote:
> I am trying to use lm for a simple linear fit with weights. The results
> I get from IDL (which I am more familiar with) seem correct and
> intuitive, but the "lm" function in R gives outputs that seem strange to
> me.
>
> Unweighted case:
>
>> x<-1:4
>> y<-(1:4)^2
>> summary(lm(y~x))
>
> Call:
> lm(formula = y ~ x)
>
> Residuals:
> 1 2 3 4
> 1 -1 -1 1
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) -5.0000 1.7321 -2.887 0.1020
> x 5.0000 0.6325 7.906 0.0156 *
>
> So far, so good - IDL does much the same:
>
> IDL> vec=linfit(x,y,sigma=sig)
> IDL> print,vec,sig
> -5.00000 5.00000
> 1.73205 0.632456
>
> Now, if the dependent variable has known (measurement) uncertainties (10,
> say) then it is appropriate to use weights defined as the inverse of the
> variances, right?
If you think that in y = x'b + e the error has
E(e^2) = sigma^2 * w
then you should use
weights = 1/w
in R because that is how the weights enter the objective function
sum 1/w (y - x'b)^2
>> summary(lm(y~x,weights=c(.01,.01,.01,.01)))
>
> Call:
> lm(formula = y ~ x, weights = c(0.01, 0.01, 0.01, 0.01))
>
> Residuals:
> 1 2 3 4
> 0.1 -0.1 -0.1 0.1
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) -5.0000 1.7321 -2.887 0.1020
> x 5.0000 0.6325 7.906 0.0156 *
>
> But here the *residuals* have changed and are no longer the actual
> response minus fitted values. (They seem to be scaled by the sqrt of the
> weights).
The summary() shows under "Residuals" the contributions to the objective
function, i.e. sqrt(1/w) (y - x'b) in the notation above.
However, by using the residuals extractor function you can get the
unweighted residuals:
residuals(lm(y~x,weights=c(.01,.01,.01,.01)))
> The uncertainties on the parameter estimates, however, have
> *not* changed, which seems very odd to me.
lm() interprets the weights as precision weights, not as case weights.
Thus, the scaling in the variances is done by the number of (non-zero)
weights, not by the sum of weights.
> The behaviour of IDL is rather different and intuitive to me:
>
> IDL> vec=linfit(x,y,sigma=sig,measure_errors=[1,1,1,1])
> IDL> print,vec,sig
> -5.00000 5.00000
> 1.22474 0.447214
>
> IDL> vec=linfit(x,y,sigma=sig,measure_errors=[10,10,10,10])
> IDL> print,vec,sig
> -5.00000 5.00000
> 12.2474 4.47214
This appears to use sandwich standard errors. If you load the sandwich and
lmtest package you can do
coeftest(m, vcov = sandwich)
where 'm' is the fitted lm object.
Hope that helps,
Z
> Note that the uncertainties are 10* larger when the errors are 10* larger.
>
> My question is really how to replicate these results in R. I suspect there is
> some terminology or convention that I'm confused by. But in the lm help page
> documentation, even the straightforward definition of "residual" seems
> incompatible with what the code actually returns:
>
> residuals: the residuals, that is response minus fitted values.
>
> But, in the weighted case, this is not actually true...
>
> James
> --
> James D Annan jdannan at jamstec.go.jp Tel: +81-45-778-5618 (Fax 5707)
> Senior Scientist, Research Institute for Global Change, JAMSTEC
> Yokohama Institute for Earth Sciences, 3173-25 Showamachi,
> Kanazawa-ku, Yokohama City, Kanagawa, 236-0001 Japan
> http://www.jamstec.go.jp/frcgc/research/d5/jdannan/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list