[R] Unexpected scores from weighted PCA with svyprcomp()

Leonardo Ferreira Fontenelle leonardof at leonardof.med.br
Sat Apr 30 04:40:11 CEST 2016


Hello!

I'd like to create an assets-based economic indicator using data from a
national household survey. The economic indicator is to be the first
principal component from a principal components analysis, which (given
the source of the data) I believe should take in consideration the
sampling weights of the observations. After running the PCA with
svyprcomp(), from the survey package, I wanted to list the loading (with
regard to the first principal component) and the scale of the variables,
so that I can tell people how to "reconstitute" the economic indicator
from the variables without any knowledge of PCA. This reconstituted
indicator wouldn't be centered, but that's OK because the important
thing for the application is the relative position of the observations.
The unexpected (at least for me) behavior was that the principal
component returned by svyprcomp() was very different from from the
reconstituted indicator as well as from the indicator returned by
predict(). "Different" here means weak correlation and different
distributions.

I hope the following code illustrates what I mean:

=====

svycor <- function(formula, design) {
  # https://stat.ethz.ch/pipermail/r-help/2003-July/036645.html
  stopifnot(require(survey))
  covariance.matrix <- svyvar(formula, design)
  variables <- diag(covariance.matrix)
  correlation.matrix <- covariance.matrix / sqrt(variables %*%
  t(variables))
  return(correlation.matrix)
}

library(survey)
data(api)
dclus2 <- svydesign(ids = ~ dnum + snum, fpc = ~ fpc1 + fpc2, data =
apiclus2)
pc <- svyprcomp( ~ api99 + api00, design = dclus2, scale = TRUE, scores
= TRUE)
dclus2$variables$pc1 <- pc$x[, "PC1"]
dclus2$variables$pc2 <- predict(pc, apiclus2)[, "PC1"]
mycoef <- pc$rotation[, "PC1"] / pc$scale
dclus2$variables$pc3 <- with(apiclus2, api99 * mycoef["api99"] + api00 *
mycoef["api00"])
svycor(~ pc1 + pc2 + pc3, dclus2)[, ]
#           pc1       pc2       pc3
# pc1 1.0000000 0.2078789 0.2078789
# pc2 0.2078789 1.0000000 1.0000000
# pc3 0.2078789 1.0000000 1.0000000
plot(svysmooth(~ pc1, dclus2), xlim = c(-2.5, 5), ylim = 0:1)
lines(svysmooth(~ pc2, dclus2), col = 2)
lines(svysmooth(~ pc3, dclus2), col = 3)
legend("topright", legend = c('pc$x[, "PC1"]', 'predict(pc, apiclus2)[,
"PC1"]', 'Reconstituted indicator'), col = 1:3, lty = 1)

sessionInfo()
# R version 3.2.4 Revised (2016-03-16 r70336)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Arch Linux
# 
#  locale:
#  [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=pt_BR.UTF-8    
#  [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=pt_BR.UTF-8   
#  [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] grid      stats     graphics  utils     datasets  grDevices
# [7] methods   base     
# 
# other attached packages:
# [1] KernSmooth_2.23-15 survey_3.30-3     
# 
# loaded via a namespace (and not attached):
# [1] tools_3.2.4

=====

This lack of correlation doesn't happen if the survey design object has
uniform sampling weights or if the the data is analyzed with prcomp().

Why does the returned principal component is so different from the
predicted and the reconstituted ones? Are predict() and my
"reconstitution" missing something? Are the three methods equally valid
but with different interpretations? Is there a bug in svyprcomp() ??

Thanks in advance,

Leonardo Ferreira Fontenelle
http://lattes.cnpq.br/9234772336296638



More information about the R-help mailing list