[R] Error in lm() with very small (close to zero) regressor
peter dalgaard
pdalgd at gmail.com
Sun Mar 29 12:42:23 CEST 2015
> On 28 Mar 2015, at 18:28 , Ben Bolker <bbolker at gmail.com> wrote:
>
> peter dalgaard <pdalgd <at> gmail.com> writes:
>
>>
>>
>>> On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at> business.uzh.ch> wrote:
>>>
>>> Hello everybody,
>>>
>>> I have encountered the following problem with lm():
>>>
>>> When running lm() with a regressor close to zero -
>> of the order e-10, the
>>> value of the estimate is of huge absolute value , of order millions.
>>>
>>> However, if I write the formula of the OLS estimator,
>> in matrix notation:
>>> pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning the
>>> estimate has value 0.
>
> How do you know this answer is "correct"?
>
>>> here is the code:
>>>
>>> y <- rnorm(n_obs, 10,2.89)
>>> x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045)
>>> x2 <- rnorm(n_obs, 10,3.21)
>>> X <- cbind(x1,x2)
>>>
>>> bFE <- lm(y ~ x1 + x2)
>>> bFE
>>>
>>> bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
>>> bOLS
>>>
>>>
>>> Note: I am applying a deviation from the
>> mean projection to the data, that
>>> is why I have some regressors with such small values.
>>>
>>> Thank you for any help!
>>>
>>> Raluca
>
> Is there a reason you can't scale your regressors?
>
>>>
>> Example not reproducible:
>>
>
> I agree that the OP's question was not reproducible, but it's
> not too hard to make it reproducible. I bothered to use
> library("sos"); findFn("pseudoinverse") to find pseudoinverse()
> in corpcor:
Well, it shouldn't be my work, nor yours... And I thought it particularly egregious to treat a function from an unspecified package as gospel.
>
> It is true that we get estimates with very large magnitudes,
> but their
>
> set.seed(101)
> n_obs <- 1000
> y <- rnorm(n_obs, 10,2.89)
> x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17)
> x2 <- rnorm(n_obs, 10,3.21)
> X <- cbind(x1,x2)
>
> bFE <- lm(y ~ x1 + x2)
> bFE
> coef(summary(bFE))
>
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 1.155959e+01 2.312956e+01 0.49977541 0.6173435
> x1 -1.658420e+14 1.872598e+15 -0.08856254 0.9294474
> x2 3.797342e-02 2.813000e-02 1.34992593 0.1773461
>
> library("corpcor")
> bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
> bOLS
>
> [,1]
> [1,] 9.807664e-16
> [2,] 8.880273e-01
>
> And if we scale the predictor:
>
> bFE2 <- lm(y ~ I(1e14*x1) + x2)
> coef(summary(bFE2))
>
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 11.55958731 23.12956 0.49977541 0.6173435
> I(1e+14 * x1) -1.65842037 18.72598 -0.08856254 0.9294474
> x2 0.03797342 0.02813 1.34992593 0.1773461
>
> bOLS stays constant.
>
> To be honest, I haven't thought about this enough to see
> which answer is actually correct, although I suspect the
> problem is in bOLS, since the numerical methods (unlike
> the brute-force pseudoinverse method given here) behind
> lm have been carefully considered for numerical stability.
In particular, the pseudoinverse() function has a tol= argument which allows it to zap small singular values.
> pseudoinverse(crossprod(X))%*%crossprod(X,y)
[,1]
[1,] 9.807664e-16
[2,] 8.880273e-01
> pseudoinverse(crossprod(X),tol=1e-40)%*%crossprod(X,y)
[,1]
[1,] 1.286421e+15
[2,] -5.327384e-01
Also, notice that there is no intercept in the above, so it would be more reasonable to compare to
> bFE <- lm(y ~ x1 + x2-1)
> summary(bFE)
Call:
lm(formula = y ~ x1 + x2 - 1)
Residuals:
Min 1Q Median 3Q Max
-9.1712 -1.9439 -0.0321 1.7637 9.4540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1 7.700e+14 2.435e+13 31.62 <2e-16 ***
x2 3.766e-02 2.811e-02 1.34 0.181
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.771 on 998 degrees of freedom
Multiple R-squared: 0.9275, Adjusted R-squared: 0.9273
F-statistic: 6382 on 2 and 998 DF, p-value: < 2.2e-16
Or, maybe better to use cbind(1,X) in the pseudoinverse() method. It doesn't help, though.
What really surprises me is this
> X <- cbind(x1,x2)
> crossprod(X)
x1 x2
x1 1.526698e-25 1.265085e-10
x2 1.265085e-10 1.145462e+05
> solve(crossprod(X))
Error in solve.default(crossprod(X)) :
system is computationally singular: reciprocal condition number = 1.13052e-31
> solve(crossprod(X), tol=1e-40)
x1 x2
x1 7.722232e+25 -8.528686e+10
x2 -8.528686e+10 1.029237e-04
> pseudoinverse(crossprod(X), tol=1e-40)
[,1] [,2]
[1,] 6.222152e+25 -6.217178e+10
[2,] -6.871948e+10 7.739466e-05
How does pseudoinverse() go so badly wrong? Notice that apart from the scaling, this really isn't very ill-conditioned, but a lot of accuracy seems to be lost in its SVD step. Notice that
> X <- cbind(1e14*x1,x2)
> solve(crossprod(X))
x2
0.0077222325 -0.0008528686
x2 -0.0008528686 0.0001029237
> pseudoinverse(crossprod(X))
[,1] [,2]
[1,] 0.0077222325 -0.0008528686
[2,] -0.0008528686 0.0001029237
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-help
mailing list