[R] Design-consistent variance estimate

Fri Aug 15 18:32:46 CEST 2008

Dear List:

(reposting due to word wrap problems, my apologies)

I am working to understand some differences between the results of the
svymean() function in the survey package and from code I have written
myself. The results from svymean() also agree with results I get from
SAS proc surveymeans, so, this suggests I am misunderstanding something.

I am never comfortable with "I did what the software" does mentality, so
I am working to fully replicate the methods to make sure I know what is
going on and then I can feel better about software. 

Below I provide three things for this conversation should anyone engage.
First I provide an example of R code where my code replicates svymean()
exactly. Second, I provide an example where the same code is used but
the results of my code and svymean() do not agree. Last, at the very
bottom of this is a minimal LaTeX code that you can compile to see my
derivation of the variance estimator using the taylor series expansion.
My code reflects this derivation.

### Below is an example where my code and svymean agree. 
### My code implements the taylor series reflected in the LaTeX below
### These data are balanced in that there are the same number of
individuals per cluster 
### Create correlated data
set.seed(100)
x <- rnorm(10)
y <- rep(10,10)
z <- rep(x,y)
score <- rnorm(100) + z
mm <- data.frame(group = gl(10,10), score)

### Use svymean to get results
gg <- svydesign(ids = ~group, data = mm) 
svymean(~score, gg, deff='replace', na.rm=TRUE) 

### My code which replicates the formulae in the LaTeX code below 
## Cluster totals 
xx <- with(mm, tapply(score, group, sum, na.rm=TRUE)) 
## Mean of the total 
ss <- mean(xx, na.rm=T) 
## get total variance 
k <- length(xx) 
totvar <- sum((xx-ss)^2) * k/(k-1) 
## The SE is then 
N <- nrow(mm)
sqrt(totvar/N^2)

If you rnu this you can see the results are the same between my code and
svymean().

One additional thing to point out is this:
## sqrt of variance of the total is: 
> svytotal(~score, gg)
        total     SE
score -1.7923 21.139

## sqrt of variance of total from my code
> sqrt(totvar)
[1] 21.13849

That is, the sqrt of the variance of the total from svytotal and my code
are the same.

### Now, the code below presents only the output as this is from 
### a data file I am working with.
### I can send data to anyone interested 
### It is the same code as above, just applied to a data
> gg <- svydesign(ids = ~CSRSSchoolCode, data = qq)
Warning message:
In svydesign.default(ids = ~CSRSSchoolCode, data = qq) :
  No weights or probabilities supplied, assuming equal probability
> svymean(~OSPI25632, gg, deff='replace', na.rm=TRUE)
             mean      SE   DEff
OSPI25632 0.84250 0.01072 1.0384
> 
> ### My code which replicates the formulae in the LaTeX code below 
## Cluster totals 
xx <- with(qq, tapply(OSPI25632, CSRSSchoolCode, sum, na.rm=TRUE)) 
## Mean of the total 
ss <- mean(xx, na.rm=T) 
## get total variance 
k <- length(xx) 
totvar <- sum((xx-ss)^2) * k/(k-1) 
## The SE is then 
N <- nrow(qq)
> sqrt(totvar/N^2)
[1] 0.02158459

Now, we can see the SE's are different. But, look at the following:

> svytotal(~OSPI25632, gg)
          total     SE
OSPI25632  1011 25.901
> sqrt(totvar)
[1] 25.90151

The sqrt of the variance of the total is the same as what my code
produces. But, because we get a different SE the only difference I think
is in the denominator, which is the N^2. 

Now, in my data the N size is:

> nrow(qq)
[1] 1200

But, we can do a little algebra and see what N size is being used by
svymean to get the SE as

# the numerator is the sqrt of the variance of the total from svytotal
and the denominator is 
# the se returned by svymean
25.901/.01072
[1] 2416.138

So, if we take 2416.138 as the N size I can replicate the svymean
standard error.
> 25.901/2416.138
[1] 0.01072

But, this is not the N size. The total N size is 1200 and the cluster
size is 519 in the observed data. So, as far as I can tell, it seems
svymean() is using a different N size that what I think the formula from
the taylor series expansion would dictate. Now, this is what I cannot
seem to reconcile. I am certain there is either another difference
somewhere or a good reason why this is occuring, I just can't seem to
get my head around the reason and would appreciate anyone with insight
into this problem. 

SessionInfo and LaTex are below.

Many thanks,
Harold

Below is the minimal LaTeX code for additional transparency.

\documentclass[12pt]{article}
\usepackage{bm,geometry}
\begin{document}

First define the ratio estimator of the mean as:

\begin{equation}
f(Y) = \frac{Y}{N}
\end{equation}

\noindent where $Y$ is the total of the variable and $N$ is the
population size. A first-order taylor series expansion of the ratio
estimator $f(Y)$ can be used to derive the design-consistent variance
estimator as:

\begin{equation}
var(f(Y))  =  \left[\frac{\partial f(Y)}{Y}\right]^2 var(Y) 
\end{equation}

\noindent where

\begin{equation}
\left[\frac{\partial f(Y)}{Y}\right] = \frac{1}{N} 
\end{equation}

\begin{equation}
var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2
\end{equation}

\begin{equation}
\hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)} 
\end{equation}

\begin{equation}
\hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j 
\end{equation}

\noindent where $j$ indexes cluster $(1, 2, \ldots, k)$, $j(i)$ indexes
the $i$th member of cluster $j$, and $n_j$ is the total number of
members in cluster $j$. The standard error is then taken as:

\begin{equation}
se = \sqrt{\frac{var(f(Y))}{N^2}}
\end{equation}

\end{document}

> sessionInfo()
R version 2.7.1 (2008-06-23)
i386-pc-mingw32 

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] survey_3.8

loaded via a namespace (and not attached):
[1] tools_2.7.1