[R] Design-consistent variance estimate
Doran, Harold
HDoran at air.org
Fri Aug 15 18:32:46 CEST 2008
Dear List:
(reposting due to word wrap problems, my apologies)
I am working to understand some differences between the results of the
svymean() function in the survey package and from code I have written
myself. The results from svymean() also agree with results I get from
SAS proc surveymeans, so, this suggests I am misunderstanding something.
I am never comfortable with "I did what the software" does mentality, so
I am working to fully replicate the methods to make sure I know what is
going on and then I can feel better about software.
Below I provide three things for this conversation should anyone engage.
First I provide an example of R code where my code replicates svymean()
exactly. Second, I provide an example where the same code is used but
the results of my code and svymean() do not agree. Last, at the very
bottom of this is a minimal LaTeX code that you can compile to see my
derivation of the variance estimator using the taylor series expansion.
My code reflects this derivation.
### Below is an example where my code and svymean agree.
### My code implements the taylor series reflected in the LaTeX below
### These data are balanced in that there are the same number of
individuals per cluster
### Create correlated data
set.seed(100)
x <- rnorm(10)
y <- rep(10,10)
z <- rep(x,y)
score <- rnorm(100) + z
mm <- data.frame(group = gl(10,10), score)
### Use svymean to get results
gg <- svydesign(ids = ~group, data = mm)
svymean(~score, gg, deff='replace', na.rm=TRUE)
### My code which replicates the formulae in the LaTeX code below
## Cluster totals
xx <- with(mm, tapply(score, group, sum, na.rm=TRUE))
## Mean of the total
ss <- mean(xx, na.rm=T)
## get total variance
k <- length(xx)
totvar <- sum((xx-ss)^2) * k/(k-1)
## The SE is then
N <- nrow(mm)
sqrt(totvar/N^2)
If you rnu this you can see the results are the same between my code and
svymean().
One additional thing to point out is this:
## sqrt of variance of the total is:
> svytotal(~score, gg)
total SE
score -1.7923 21.139
## sqrt of variance of total from my code
> sqrt(totvar)
[1] 21.13849
That is, the sqrt of the variance of the total from svytotal and my code
are the same.
### Now, the code below presents only the output as this is from
### a data file I am working with.
### I can send data to anyone interested
### It is the same code as above, just applied to a data
> gg <- svydesign(ids = ~CSRSSchoolCode, data = qq)
Warning message:
In svydesign.default(ids = ~CSRSSchoolCode, data = qq) :
No weights or probabilities supplied, assuming equal probability
> svymean(~OSPI25632, gg, deff='replace', na.rm=TRUE)
mean SE DEff
OSPI25632 0.84250 0.01072 1.0384
>
> ### My code which replicates the formulae in the LaTeX code below
## Cluster totals
xx <- with(qq, tapply(OSPI25632, CSRSSchoolCode, sum, na.rm=TRUE))
## Mean of the total
ss <- mean(xx, na.rm=T)
## get total variance
k <- length(xx)
totvar <- sum((xx-ss)^2) * k/(k-1)
## The SE is then
N <- nrow(qq)
> sqrt(totvar/N^2)
[1] 0.02158459
Now, we can see the SE's are different. But, look at the following:
> svytotal(~OSPI25632, gg)
total SE
OSPI25632 1011 25.901
> sqrt(totvar)
[1] 25.90151
The sqrt of the variance of the total is the same as what my code
produces. But, because we get a different SE the only difference I think
is in the denominator, which is the N^2.
Now, in my data the N size is:
> nrow(qq)
[1] 1200
But, we can do a little algebra and see what N size is being used by
svymean to get the SE as
# the numerator is the sqrt of the variance of the total from svytotal
and the denominator is
# the se returned by svymean
25.901/.01072
[1] 2416.138
So, if we take 2416.138 as the N size I can replicate the svymean
standard error.
> 25.901/2416.138
[1] 0.01072
But, this is not the N size. The total N size is 1200 and the cluster
size is 519 in the observed data. So, as far as I can tell, it seems
svymean() is using a different N size that what I think the formula from
the taylor series expansion would dictate. Now, this is what I cannot
seem to reconcile. I am certain there is either another difference
somewhere or a good reason why this is occuring, I just can't seem to
get my head around the reason and would appreciate anyone with insight
into this problem.
SessionInfo and LaTex are below.
Many thanks,
Harold
Below is the minimal LaTeX code for additional transparency.
\documentclass[12pt]{article}
\usepackage{bm,geometry}
\begin{document}
First define the ratio estimator of the mean as:
\begin{equation}
f(Y) = \frac{Y}{N}
\end{equation}
\noindent where $Y$ is the total of the variable and $N$ is the
population size. A first-order taylor series expansion of the ratio
estimator $f(Y)$ can be used to derive the design-consistent variance
estimator as:
\begin{equation}
var(f(Y)) = \left[\frac{\partial f(Y)}{Y}\right]^2 var(Y)
\end{equation}
\noindent where
\begin{equation}
\left[\frac{\partial f(Y)}{Y}\right] = \frac{1}{N}
\end{equation}
\begin{equation}
var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2
\end{equation}
\begin{equation}
\hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)}
\end{equation}
\begin{equation}
\hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j
\end{equation}
\noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, $j(i)$ indexes
the $i$th member of cluster $j$, and $n_j$ is the total number of
members in cluster $j$. The standard error is then taken as:
\begin{equation}
se = \sqrt{\frac{var(f(Y))}{N^2}}
\end{equation}
\end{document}
> sessionInfo()
R version 2.7.1 (2008-06-23)
i386-pc-mingw32
locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] survey_3.8
loaded via a namespace (and not attached):
[1] tools_2.7.1
More information about the R-help
mailing list