[R] Chi-Square test and survey results

gheine at mathnmaps.com gheine at mathnmaps.com
Tue Oct 11 21:31:46 CEST 2011

An organization has asked me to comment on the validity of their
recent all-employee survey.  Survey responses, by geographic region, 
with the total number of employees in each region, were as follows:

> ByRegion
           All.Employees Survey.Respondents
Region_1            735                142
Region_2            500                 83
Region_3            897                 78
Region_4            717                133
Region_5            167                 48
Region_6            309                  0
Region_7            806                125
Region_8            627                122
Region_9            858                177
Region_10           851                160
Region_11           336                 52
Region_12          1823                312
Region_13            80                  9
Region_14           774                121
Region_15           561                 24
Region_16           834                134

How well does the survey represent the employee population?
Chi-square test says, not very well:

> chisq.test(ByRegion)

         Pearson's Chi-squared test

data:  ByRegion
X-squared = 163.6869, df = 15, p-value < 2.2e-16

By striking three under-represented regions (3,6, and 15), we get
a more reasonable, although still not convincing, result:

> chisq.test(ByRegion[setdiff(1:16,c(3,6,15)),])

         Pearson's Chi-squared test

data:  ByRegion[setdiff(1:16, c(3, 6, 15)), ]
X-squared = 22.5643, df = 12, p-value = 0.03166

This poses several questions:

1)  Looking at a side-by-side barchart (proportion of responses vs.
proportion of employees, per region), the pattern of survey responses
appears, visually, to match fairly well the pattern of employees.  Is
this a case where we trust the numbers and not the picture?

2) Part of the problem, ironically, is that there were too many 
to the survey.  If we had only one-tenth the responses, but in the same
proportions by region, the chi-square statistic would look much better,
(though with a warning about possible inaccuracy):

data:  data.frame(ByRegion$All.Employees, 0.1 * 
X-squared = 17.5912, df = 15, p-value = 0.2848

Is there a way of reconciling a large response rate with an 
response profile?  Or is the bad news that the survey will give very 
results about a very ill-specified sub-population?

(Of course, I would put in softer terms, like "you need to assess the 
of homogeneity across different regions" .)

3) Is Chi-squared really the right measure of how representative is the 

<<<<<<< >>>>>>>>>

Thanks for any help you can give - hope these questions make sense -

George H.

More information about the R-help mailing list