[R] Error in lm() with very small (close to zero) regressor

Mon Mar 30 02:42:23 CEST 2015

There are two issues:
1) Damage was done to accuracy in the calculation of t(X) %*% X
Where the mean to SD ratio is large, maybe even of the order of
100 or so, centre that column.

2) pseudo inverse() goes awry with columns of X that are of the
order of large negative (or, I expect, +ve) powers of ten.
were supplied to it. I’d guess that this has to do with the way that
a check for singularity is implemented.

Ben’s example:

set.seed(101)
n_obs <- 1000
y  <- rnorm(n_obs, 10,2.89)
x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17)
x2 <- rnorm(n_obs, 10,3.21)
X  <- cbind(rep(1,n_obs),x1,x2)

coef(lm(y~x1+x2))
##   (Intercept)            x1            x2
##  1.155959e+01 -1.658420e+14  3.797342e-02

library(corpcor)
t(cbind(coef(lm(y~x1+x2)),
           as.vector(pseudoinverse(t(X) %*% X) %*% t(X) %*% y)))
##      (Intercept)            x1         x2
## [1,]   11.559587 -1.658420e+14 0.03797342
## [2,]     9.511348  1.151908e-13  0.03788714

## Examine mean to SD ratios
round(c(mean(x1)/sd(x1), mean(x2)/sd(x2)), 2)
## [1] 263.65   3.28
## Notice that the coefficients for x2, where the mean/sd ratio
## is smaller, roughly agree.

## Damage was done to accuracy in the calculation of t(X) %*% X
## (but there is more to it than that, as we will see).

## Try
xc1 <- scale(x1,center=TRUE, scale=FALSE)
xc2 <- scale(x2,center=TRUE, scale=FALSE)
Xc <- cbind(rep(1,n_obs),x1,x2)
as.vector(pseudoinverse(t(Xc) %*% Xc) %*% t(Xc) %*% y)
## [1] 9.511348e+00 1.151908e-13 3.788714e-02

## Note now, however, that one should be able to dispense with
## the column of 1s, with no change to the coefficients of x1 & x2
Xc0 <- cbind(xc1,xc2)
as.vector(pseudoinverse(t(Xc0) %*% Xc0) %*% t(Xc0) %*% y)
## [1] 1.971167e-20 3.788714e-02

##
## Now try a more sensible scaling for xc1
Xcs <- cbind(rep(1,n_obs), xc1*1e14, xc2)
Xcs0 <- cbind(xc1*1e14, xc2)

t(cbind(coef(lm(y~I(1e14*xc1)+xc2)),
            as.vector(pseudoinverse(t(Xcs) %*% Xcs) %*% t(Xcs) %*% y)))
##      (Intercept) I(1e+14 * xc1)        xc2
## [1,]    9.899249       -1.65842 0.03797342
## [2,]    9.899249       -1.65842 0.03797342

## Eureka!

## cf also
as.vector(pseudoinverse(t(Xcs0) %*% Xcs0) %*% t(Xcs0) %*% y)
## [1] -1.65842037  0.03797342

John Maindonald             email: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au>

Centre for Mathematics & Its Applications,

Australian National University, Canberra ACT 0200.

On 29/03/2015, at 23:00, <r-help-request at r-project.org<mailto:r-help-request at r-project.org>> <r-help-request at r-project.org<mailto:r-help-request at r-project.org>> wrote:

From: Ben Bolker <bbolker at gmail.com<mailto:bbolker at gmail.com>>
Subject: Re: [R] Error in lm() with very small (close to zero) regressor
Date: 29 March 2015 6:28:22 NZDT
To: <r-help at stat.math.ethz.ch<mailto:r-help at stat.math.ethz.ch>>

peter dalgaard <pdalgd <at> gmail.com<http://gmail.com/>> writes:

On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at> business.uzh.ch<http://business.uzh.ch/>> wrote:

Hello everybody,

I have encountered the following problem with lm():

When running lm() with a regressor close to zero -
of the order e-10, the
value of the estimate is of huge absolute value , of order millions.

However, if I write the formula of the OLS estimator,
in matrix notation:
pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning the
estimate has value 0.

 How do you know this answer is "correct"?

here is the code:

y  <- rnorm(n_obs, 10,2.89)
x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045)
x2 <- rnorm(n_obs, 10,3.21)
X  <- cbind(x1,x2)

Eh?  You want
X  <- cbind(rep(1,n_obs),x1,x2)

bFE <- lm(y ~ x1 + x2)
bFE

bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
bOLS

Note: I am applying a deviation from the
mean projection to the data, that
is why I have some regressors with such small values.

Thank you for any help!

Raluca

 Is there a reason you can't scale your regressors?

Example not reproducible:

 I agree that the OP's question was not reproducible, but it's
not too hard to make it reproducible. I bothered to use
library("sos"); findFn("pseudoinverse") to find pseudoinverse()
in corpcor:

It is true that we get estimates with very large magnitudes,
but their

set.seed(101)
n_obs <- 1000
y  <- rnorm(n_obs, 10,2.89)
x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17)
x2 <- rnorm(n_obs, 10,3.21)
X  <- cbind(x1,x2)

bFE <- lm(y ~ x1 + x2)
bFE
coef(summary(bFE))

                Estimate   Std. Error     t value  Pr(>|t|)
(Intercept)  1.155959e+01 2.312956e+01  0.49977541 0.6173435
x1          -1.658420e+14 1.872598e+15 -0.08856254 0.9294474
x2           3.797342e-02 2.813000e-02  1.34992593 0.1773461

library("corpcor")
bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
bOLS

            [,1]
[1,] 9.807664e-16
[2,] 8.880273e-01

And if we scale the predictor:

bFE2 <- lm(y ~ I(1e14*x1) + x2)
coef(summary(bFE2))

                Estimate Std. Error     t value  Pr(>|t|)
(Intercept)   11.55958731   23.12956  0.49977541 0.6173435
I(1e+14 * x1) -1.65842037   18.72598 -0.08856254 0.9294474
x2             0.03797342    0.02813  1.34992593 0.1773461

bOLS stays constant.

To be honest, I haven't thought about this enough to see
which answer is actually correct, although I suspect the
problem is in bOLS, since the numerical methods (unlike
the brute-force pseudoinverse method given here) behind
lm have been carefully considered for numerical stability.

	[[alternative HTML version deleted]]