[R] Survey - twophase

Mon Jun 5 18:42:19 CEST 2006

On Mon, 5 Jun 2006, Mark Hempelmann wrote:

> Dear WizaRds,
>
>    I am struggling with the use of twophase in package survey. My goal
> is to compute a simple example in two phase sampling:
>
> phase 1: I sample n1=1000 circuit boards and find 80 non functional
> phase 2: Given the n1=1000 sample I sample n2=100 and find 15 non
> functional. Let's say, phase 2 shows this result together with phase 1:
> ...................phase1........
> ...................ok defunct....
> phase2 ok..........85....0.....85
> .......defunct......5...10.....15
> sum................90...10....100
>
> That is in R:
> fail <- data.frame(id=1:1000 , x=c(rep(0,920), rep(1,80)),
> y=c(rep(0,985), rep(1,15)), n1=rep(1000,1000), n2=rep(100,1000),
> N=rep(5000,1000))
>
> des.fail    <- twophase(id=list(~id,~id), data=fail, subset=~I(x==1))
> #    fpc=list(~n1,~n2)

The second-phase sample is described by subset=~I(x==1), so you have 
sampled only 80 in phase two, not 100.

> svymean(~y, des.fail)
>
> gives mean y 0.1875, SE 0.0196, but theoretically,
> we have x.bar1 (phase1)=0.08 and y.bar2 (phase2)=0.15 defect boards.

15/80=0.1875

> Two phase sampling assumes some relation between the easily/ fast
> received x-information and the elaborate/ time-consuming y-information,
> say a ratio r=sum y (phase2)/ sum x (phase2)=15/10=1.5 (out of the above
> table)

Not quite. Two-phase sampling is *useful* only where there is a 
relationship. No relationship is *assumed*.

There are two ways you can take advantage of a relationship. The first is 
to stratify the phase-two sampling based on phase one information.  In 
this case you need a strata= argument to twophase().

The second way to use a relationship is to calibrate phase two to phase 
one, using the calibrate() function.  This is analogous to the regression 
estimator you describe.

A good example to look at is in vignette("epi").  This describes a 
two-phase sample where about 4000 people are in the first stage (a cancer 
clinical trial) and then the second phase is sampled based on relapse and 
on disease type ("histology") determined at the local hospital.
  Disease type is determined more accurately at a central lab for everyone 
who relapses, everyone whose locally-determined disease type is bad, and 
20% of the rest.

There is also an example of calibration, post-stratifying the second phase 
to the first phase on disease stage, for the same data.

Finally, note that twophase() does not use the unbiased estimator of 
variance. It uses a modification that is easier to compute for cluster 
samples, as described in vignette("phase1").  There is no difference if 
the first phase is sampled from an infinite population (or with 
replacement), which is the case in vignette("epi").

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle