[R] How to do a time-stratified case-crossover analysis for air pollution data? Unformatted text-version, with an additional note
Nilsson Fredrik X
Fredrik.X.Nilsson at skane.se
Fri Mar 7 16:08:13 CET 2008
Dear Experts,
I am trying to do a time-stratified case-crossover analysis on air pollution data and number of myocardial infarctions. In order to avoid model selection bias, I started with a simple simulation.
I'm still not sure if my simulation is right. But the results I get from the "ts-case-crossover" are much more variable than those from a glm.
Is this:
a. Due to the simple relation of log-rate of mi cases being = alpha + beta*pm that the glm results are more precise?
b. Due to me using method="approximate" instead of default "exact" in the clogit?
c. Due to the fact that the "approximate" method in clogit use "breslow" for handling ties and not "efron"?
d. Due to that I've misunderstood how to arrange data when doing a case-crossover, such that there is a way of using "exact", that wouldn't go berserk due to the many ties (see #Berserk script at bottom).
e. The prize to pay for using a case-crossover analysis?
f. Or, are my simulation results for glm overly precise due to the positive autocorrelation (changed to -0.9 which made the time-stratifed be more precise, but not the glm). This note was not in the HTML-version. Sorry for that.
It should perhaps be noted that, because of the absence of individual data, the exposure to air-pollution (pm10) is assumed to be common to all individuals on a certain day.
I'd be most grateful for any help and ideas on this matter.
Best regards,
Fredrik Nilsson, PhD
PS. I am aware of the limitations that Whitaker et al. presented in Environmetrics 2007; 18: 157-171, but tried to use the time-stratified case-crossover as it could be a "simple", but fairly correct way of doing this type of analysis.
Simulation script:
library(survival)
n<-2*365
samp<-100
ti<-1:n
ar1<-rnorm(1)
for (i in 2:n){
ar1[i]<-0.2*ar1[i-1]+rnorm(1)
}
#old version of air pollution
#pm10<-2 + .5*sin(2*pi*ti/365 + 127) + 0.1*ar1
startdate<-"1992-07-01"
date1<-as.Date(startdate)
dates<-date1 + (ti-1)
tyda<-weekdays(dates)
tymo<-months(dates)
tyyear<-as.character(dates)
for (i in 1:n)
{
ask<-tyyear[i]
tyyear[i]<-substr(ask,1,4)
}
moeff<-cumsum(c(1,31,28,31,30,31,30,31,31,30,31,30))
a<-as.Date("1994-12-31")
monthnames<-months(a+moeff)
tymofa<-match(tymo,monthnames)
moeff<-.5*sin(2*pi*moeff/365 + 127)
#new version; to have strictly stationary levels of air-pollution
#within strata
pm10<-2 + .5*moeff[tymofa] + 0.1*ar1
#intercept and proportionality to pm10 coefficients of log-rate
al<- -2.5
be<- 1.4
qres<-numeric(samp)
glmres<-qres
gamres<-qres
for(q in 1:samp)
{
rate<-exp(al + be*pm10)
mi<-rpois(length(rate),lambda=rate)
date1<-as.Date(startdate)
dates<-date1 + (ti-1)
tyda<-weekdays(dates)
tymo<-months(dates)
tyyear<-as.character(dates)
for (i in 1:n)
{
ask<-tyyear[i]
tyyear[i]<-substr(ask,1,4)
}
#replicate cases
air<-data.frame(mi, tyda, tymo, tyyear, pm10, dates)
air$stratn<-as.numeric(strata(air$tyda, air$tymo,air$tyyear))
lest<-unique(air$stratn)
air$nctrl<-0
airbase<-air
#find the number of controls whithin each stratum
for (i in 1:length(lest))
{
a<-which(air$stratn==lest[i])
for (j in 1:length(a))
{
air$nctrl[a[j]]<-sum(air$mi[a[-j]])
}
}
#create cases and controls
cami<-rep(1,sum(mi))
ctmi<-rep(0,sum(air$nctrl))
capm<-rep(pm10, mi)
ctpm<-rep(pm10, air$nctrl)
cast<-rep(air$stratn, mi)
ctst<-rep(air$stratn, air$nctrl)
cady<-rep(dates, mi)
ctdy<-rep(dates, air$nctrl)
cases<-c(cami, ctmi)
stranu<-c(cast, ctst)
days<-c(cady, ctdy)
pmva<-c(capm, ctpm)
air2<-data.frame(cases, days, pmva,stranu)
air.cl<-clogit(cases~pmva + strata(stranu), method="approximate", data=air2)
air.glm<-glm(mi~pm10, family=poisson, data=air)
#air.gam<-gam(mi~s(pm10), family=poisson)
qres[q]<-as.numeric(coef(air.cl))
glmres[q]<-as.numeric(coef(air.glm)[2])
#gamres[q]<-as.numeric(coef(air.gam)[2])
}
par(mfrow=c(2,1))
plot(density(qres))
plot(density(glmres))
fivenum(qres)
fivenum(glmres)
# Berserk script
# this shows that even for very few cases, my intended way of doing TS C-C is quite time-consuming.
#
#case<-c(5,0,2,1)
#airq<-c(10,2,3,3)
#
#nctrl<-0*case
#for (i in 1:length(case))
#{
# nctrl[i]<-sum(case[-i])
#}
#
#airca<-rep(airq,case)
#airct<-rep(airq,nctrl)
#
#cases<-rep(1,sum(case))
#ctrls<-rep(0,sum(nctrl))
#
#mi<-c(cases,ctrls)
#airq<-c(airca,airct)
#a<-Sys.time()
#air.cl<-clogit(mi~airq)
#b<-Sys.time()
#b-a
#a<-Sys.time()
#air.cl<-clogit(mi~airq, method="approximate")
#b<-Sys.time()
#b-a
More information about the R-help
mailing list