[R] Appropriate test for overdispersion in binomial data
Chris Oosthuizen
wcoosthuizen at zoology.up.ac.za
Thu Feb 18 02:12:34 CET 2010
Dear R users,
Overdispersion is often a problem in binomial data. I attempt to model a
binary response (sex-ratio) with three categorical explanatory variables,
using GLM, which could assume the form:
y<-cbind(sexf, sample-sexf)
model<-glm(y ~ age+month+year, binomial)
summary(model)
Output:
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8956.7 on 582 degrees of freedom
Residual deviance: 4111.9 on 555 degrees of freedom
AIC: 6735.2
Following MJ Crawley (The R Book 2007) this model can be updated to:
model2<-glm(y ~ age+month+year, quasibinomial)
summary(model2)
Output:
(Dispersion parameter for quasibinomial family taken to be 7.080681)
Null deviance: 8956.7 on 582 degrees of freedom
Residual deviance: 4111.9 on 555 degrees of freedom
AIC: NA
As far as I can tell, R users (from the Forum) and MJ Crawley calculate
the degree of overdispersion for binomial data from residual deviance (the
residual scaled deviance should be roughly equal to the residual degrees
of freedom).
HOWEVER, please read the following comment, that I copied from the thread
"Under dispersion; Was: [R] binomial glm warnings revisited", posted in
2003 by Peter Dalgaard:
"Don't trust deviances as measures of dispersion with binary data!" and
"With binary data, the deviance is purely a function of the fitted
parameters. It is the difference in -2 log L between a "perfect fit"
and the observed fit. A perfect fit has a zero prob. where the obs is
"0" and probability 1 where it is "1", and L == 1 identically in that
case. Now consider the likelihood for the "complete toss-up" i.e.
intercept and slope both equal to 0 so all probabilities are 0.5. The
likelihood in that case is 0.5^269, i.e. a constant. Take logarithms
and notice that the model deviance plus the change in deviance from
the model to the "toss-up" model is constant (2*269*log(2) to be
precise). So what appears to be a measure of residual error is
really just a measure of how far the fitted probabilities are from
0.5!"
My questions are:
1) Is residual deviance / df an appropriate measure of dispersion for
binary data? (it seems to be widely used)
2) If I understand P. Daldaard's comment correctly, and it is not, what is
the appropriate way?
Many thanks to all who have asked and anwered questions in the past - it
is of great assistance.
Chris
--
W.C.Oosthuizen
Mammal Research Institute
Department of Zoology & Entomology
University of Pretoria
Pretoria
South Africa
------------------------------------------------------------------
This message and attachments are subject to a disclaimer. Please refer to http://www.it.up.ac.za/documentation/governance/disclaimer/ for full details. / Hierdie boodskap en aanhangsels is aan 'n vrywaringsklousule onderhewig. Volledige besonderhede is by http://www.it.up.ac.za/documentation/governance/disclaimer/ beskikbaar.
More information about the R-help
mailing list