[R] bigglm() results different from glm()
Thomas Lumley
tlumley at u.washington.edu
Tue Mar 17 08:50:20 CET 2009
This is a surprisingly interesting problem that took a while to debug, because the computations all seemed correct.
Your model hasn't converged yet. You can get the right answer either by running longer:
> summary(m1big_longer)
Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"),
chunksize = 100000, maxit = 20)
Sample size = 100000
Coef (95% CI) SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2 0.405 0.401 0.408 0.002 0
or supplying starting values:
> summary(m1big_started)
Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"),
chunksize = 100000, start = c(2, 0))
Sample size = 100000
Coef (95% CI) SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2 0.405 0.401 0.408 0.002 0
The bug is that you weren't told about the lack of convergence. There is a flag in the object, but it is only set after the model is converged, so it is not there when convergence fails.
> m1big$converged
NULL
> m1big_longer$converged
[1] TRUE
> m1big_started$converged
[1] TRUE
For the next version I will make sure there is a clear warning when the model hasn't converged. The default maximum number of iterations is fairly small, by design --- if it isn't working, you want to find out and specify starting values rather than wait for dozens of potentially slow iterations. This strategy obviously breaks down when you don't notice that failure. :(
-thomas
On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote:
> Dear all,
>
> I am using the bigglm package to fit a few GLM's to a large dataset (3 million
> rows, 6 columns). While trying to fit a Poisson GLM I noticed that the
> coefficient estimates were very different from what I obtained when estimating
> the model on a smaller dataset using glm(), I wrote a very basic toy example to
> compare the results of bigglm() against a glm() call. Consider the following
> code:
>
>
>> require(biglm)
>> options(digits=6, scipen=3, contrasts = c("contr.treatment", "contr.poly"))
>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000))
>> m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
>> summary(m1)
>
> <snipped output for this email>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) 2.30305 0.00141 1629 <2e-16 ***
> ttment2 0.40429 0.00183 221 <2e-16 ***
>
> Null deviance: 151889 on 99999 degrees of freedom
> Residual deviance: 101848 on 99998 degrees of freedom
> AIC: 533152
>
>> summary(m1big)
> Large data regression model: bigglm(y ~ ttment, data = dat, family =
> poisson(link = "log"))
> Sample size = 100000
> Coef (95% CI) SE p
> (Intercept) 2.651 2.650 2.653 0.001 0
> ttment2 4.346 4.344 4.348 0.001 0
>
>> m1big$deviance
> [1] 287158986
>
>
> Notice that the coefficients and deviance are quite different in the model
> estimated using bigglm(). If I change the chunk to seq(1000,10000,1000) the
> estimates remain the same.
>
> Can someone help me understand what is causing these differences?
>
> Here is my version info:
>
>> version
> _
> platform i386-pc-mingw32
> arch i386
> os mingw32
> system i386, mingw32
> status
> major 2
> minor 8.1
> year 2008
> month 12
> day 22
> svn rev 47281
> language R
> version.string R version 2.8.1 (2008-12-22)
>
>
> Many thanks in advance for your help,
>
> Francisco
>
> --
> Francisco J. Zagmutt
> Vose Consulting
> 2891 20th Street
> Boulder, CO, 80304
> USA
> www.voseconsulting.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list