[R] glmer with non integer weights

Sun Apr 18 11:29:29 CEST 2010

Le vendredi 16 avril 2010 à 00:15 -0800, Kay Cichini a écrit :
> thanks thierry,
> 
> i considered this transformations already, but variance is not stabilized
> and/or normality is neither achieved.
> i guess i'll have to look out for non-parametrics?

Or (maybe) a model based on a non-Gaussian likelihood ? A beta
distribution comes to mind, either fitted by maximum likelihood or (if
relevant prior information is available) in a Bayesian framework ?

But beware : you have a not-so-small problem ...

Your data have zeroes and ones, which, if you have no information on a
"sample size", are "sharp" zeroes and ones, and there therefore
theoretically bound to infinite linear predictors (in plain English :
bloody unlikely). These values make a "fixed effect" analysis
impossible : these points "at infinite" will make regression essentially
impossible. Consider :

> logit<-function(x)log(x/(1-x))
> ilogit<-function(x)1/(1+exp(-x))
> ilogit(coef(lm(logit(MH.Index)~0+stage,data=similarity)))
Erreur dans lm.fit(x, y, offset = offset, singular.ok =
singular.ok, ...) : 
  NA/NaN/Inf dans un appel à une fonction externe (argument 4)

You need to have some information on the precision of your zeroes and
ones nd use it to get some "possible" values of MH.Index

One might be tempted to "unround" them by a small amount (representing a
"reasonable guess" on your precision :
> epsilon=0.01
> ilogit(coef(lm(logit(MH.Index)~0
+stage,data=within(similarity,{MH.Index<-pmax(epsilon,pmin(1-epsilon,MH.Index))}))))
   stageA    stageB    stageC    stageD 
0.6490997 0.2914323 0.5087639 0.5721789 

BUT :

> epsilon=0.000001
> ilogit(coef(lm(logit(MH.Index)~0
+stage,data=within(similarity,{MH.Index<-pmax(epsilon,pmin(1-epsilon,MH.Index))}))))
   stageA    stageB    stageC    stageD 
0.6588222 0.1020177 0.5087639 0.5721789 

The estimation for stageB depends critically of the "unrounding" amount
chosen.

I tried a small fixed effect logit model in JAGS : it won't initialize
with the original data (O and 1 are effectively impossible with possible
values of the beta coefficients), and seems to exhibit, the same kind of
sensitivity to "unrounding" amount that the linear model :

LogitModFix<-local({
  Modele<-function(){
    for(k in 1:nobs) {
      ## logit(MH.Index[k])<-lpi.i[k]
      lpi.i[k]~dnorm(lpi[stage[k]], tau.lpi[stage[k]])
    }
    for(i in 1:nstage) {
      lpi[i]~dnorm(0,1.0E-6)
      tau.lpi[i]<-pow(sigma.lpi[i],-2)
      sigma.lpi[i]~dunif(0,100)
      pi[i]<-ilogit(lpi[i])
    }
  }
  Data<-function() {
    for (i in 1:nobs) {
      lpi.i[i]<-logit(MH.Index[i])
    }
  }
  tmpf<-tempfile()
  ## write.model has been shoplifted from R2WinBUGS and adapted to JAGS
  ## by allowing a "data" argument for transformations
  write.model(Modele,tmpf, data=Data)
  epsilon<-0.00001
  Modele.jags<-jags.model(tmpf,
                          data=with(similarity,

list(MH.Index=pmax(epsilon,pmin(1-epsilon,MH.Index)),
                                 ## MH.Index=MH.Index,
                                 stage=stage,
                                 nobs=nrow(similarity),
                                 nstage=nlevels(stage))),
                          n.chains=3)
  unlink(tmpf)
  Modele.coda<-coda.samples(Modele.jags,
                            variable.names=c("deviance", "pi",
"sigma.lpi"),
                            n.iter=1000)
  list(Modele.jags, Modele.coda)
})

## Convergence (not shown) is quite acceptable

> summary(LogitModFix[[2]])

Iterations = 1001:2000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 1000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                 Mean      SD  Naive SE Time-series SE
deviance     660.5656 4.71697 0.0861196      0.1672079
pi[1]          0.6304 0.21666 0.0039557      0.0040347
pi[2]          0.1506 0.08036 0.0014671      0.0014460
pi[3]          0.5082 0.03956 0.0007222      0.0006777
pi[4]          0.5696 0.07473 0.0013644      0.0012952
sigma.lpi[1]   6.9409 0.86442 0.0157821      0.0287626
sigma.lpi[2]   4.2775 0.49281 0.0089974      0.0132247
sigma.lpi[3]   1.0875 0.11506 0.0021006      0.0026048
sigma.lpi[4]   0.8801 0.29371 0.0053623      0.0118630

2. Quantiles for each variable:

                 2.5%       25%      50%      75%    97.5%
deviance     654.0216 657.13069 659.6788 662.9535 672.2101
pi[1]          0.1695   0.47600   0.6681   0.8107   0.9485
pi[2]          0.0413   0.09206   0.1345   0.1914   0.3564
pi[3]          0.4320   0.48079   0.5078   0.5366   0.5851
pi[4]          0.4154   0.52518   0.5722   0.6168   0.7135
sigma.lpi[1]   5.5553   6.33366   6.8287   7.4412   8.8207
sigma.lpi[2]   3.5082   3.93440   4.2274   4.5598   5.3497
sigma.lpi[3]   0.8833   1.00397   1.0805   1.1565   1.3412
sigma.lpi[4]   0.5304   0.67740   0.8189   1.0067   1.5987

The same model re-fitted with epsilon=0.01 gives :

> summary(LogitModFix[[2]])

Iterations = 1001:2000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 1000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                 Mean      SD  Naive SE Time-series SE
deviance     534.2639 4.42164 0.0807278       0.122396
pi[1]          0.6396 0.10877 0.0019859       0.001892
pi[2]          0.2964 0.06592 0.0012036       0.001163
pi[3]          0.5083 0.04120 0.0007522       0.000805
pi[4]          0.5702 0.07170 0.0013091       0.001221
sigma.lpi[1]   3.0387 0.36356 0.0066376       0.008803
sigma.lpi[2]   2.0519 0.22665 0.0041381       0.005358
sigma.lpi[3]   1.0928 0.12531 0.0022879       0.003305
sigma.lpi[4]   0.8699 0.26650 0.0048656       0.010146

2. Quantiles for each variable:

                 2.5%      25%      50%      75%    97.5%
deviance     527.9142 530.9711 533.4347 536.6295 544.7738
pi[1]          0.4130   0.5683   0.6429   0.7195   0.8251
pi[2]          0.1826   0.2511   0.2909   0.3373   0.4390
pi[3]          0.4266   0.4809   0.5093   0.5359   0.5892
pi[4]          0.4183   0.5254   0.5701   0.6150   0.7106
sigma.lpi[1]   2.4362   2.7772   2.9965   3.2600   3.8395
sigma.lpi[2]   1.6606   1.8935   2.0320   2.1811   2.5598
sigma.lpi[3]   0.8803   1.0035   1.0754   1.1711   1.3670
sigma.lpi[4]   0.5022   0.6901   0.8198   0.9906   1.5235

One should also note that the deviance is *extremely* sensitive to the
choice of epsilon, whic tends to indicate that, at small epsilon values,
the "sharp" "impossible" values dominate the evaluation of the
oefficients they are involved in.

Beta models (not shown) give similar results, to a lesser extend.

In short, your "sharp" data are essentially incompatible with your
model, which can't be fitted unless you get at least a rough idea of
their precision.

HTH,

				Emmanuel Charpentier, DDS, MSc

> best regards,
> kay