[R] (OT) Does pearson correlation assume bivariate normality of the data?
Thomas Lumley
tlumley at u.washington.edu
Tue May 26 23:01:40 CEST 2009
This is the sort of problem (another related one is the assumptions of the
t-test) that attracts a lot of relatively inefficient argument.
Some basic points
1. If random variables X and Y are uncorrelated (and have finite moments,
but that's a purely technical issue), the distribution of the Pearson
correlation coefficient in samples from X and Y will be Normal with mean
zero in large samples. No further assumption about distribution is needed.
So, the test is valid in sufficiently large samples.
2. Similarly, the sample correlation coefficient between two random
variables X and Y is a consistent estimator of the correlation between X
and Y. Here the distribution [needed for confidence intervals] does
depend on the distributions of X and Y, but by less than you might expect.
For example, I found that Fisher's z-transformation and a t-distribution
with n-3 df is a pretty good approximation to the distribution of
correlation between lognormal random variables (a model for air pollution
data) with a sample size of 10.
3. If X and Y are bivariate Normal and uncorrelated, they must be
independent, so the null hypothesis of zero correlation is especially
interesting for Normal data.
4. Zero correlation may still be an interesting null hypothesis without
bivariate Normality -- if you don't know much about X and Y it may be an
advance to be able to establish that Y tends to be higher when X is
higher.
5. The correlation coefficient is sensitive to outlying observations. This
is not necessarily a bad thing, but it means that if X and Y both have
long-tailed distributions the test for zero correlation will be sensitive
primarily to the tails.
6. If the tails of the distribution are mostly gross-error contamination,
the sensitivity to the tails is bad.
7. The various robust or rank-based correlations don't estimate the same
thing, any more than the mean and median estimate the same thing. They
don't necessarily even have to have the same sign. Some of them are
intended for bivariate Normal data with gross-error contamination, which
is fine if that is what you have. Kendall's tau at least has a sensible
interpretation that doesn't depend on distributions, whereas it's not
clear to me why the hypothesis of zero Spearman correlation would be
interesting without distributional assumptions.
8. Permutation tests will give you an exact small-sample test of
*independence*, not of zero correlation. The test is not exact (it may be
conservative or anticonservative) if X and Y are dependent but
uncorrelated. The test has power only against alternatives where the
correlation is non-zero.
Some of the issues behind the confusion are the same as for the t-test:
- a confusion of necessary vs sufficient assumptions
- a confusion of long-tailed distributions and gross error contamination
- worrying about the meaning of the null hypothesis only for 'parametric'
tests and not for 'non-parametric tests'
- not understanding that permutation tests have assumptions.
There is also some genuine and informed disagreement about the relative
importance of potential problems. Some of this disagreement is about
philosophical issues, and some is about the likely pratical impact, which
depends a lot on the setting.
-thomas
On Tue, 26 May 2009, Liviu Andronic wrote:
> Dear all,
> The other day I was reading this post [1] that slightly surprised me:
> "To reject the null of no correlation, an hypothsis test based on the
> normal distribution. If normality is not the base assumption your
> working from then p-values, significance tests and conf. intervals
> dont mean much (the value of the coefficient is not reliable) " (BOB
> SAMOHYL).
>
> To me this implied that in practice Pearson's product-moment
> correlation (and associated significance) is often used incorrectly .
> Then I went wrestling with the literature, and with my friends on what
> does the Pearson correlation actually impose, and after about a week
> I'm still head-banging against divergent opinions. From what I
> understand there are two aspects to this classical parametric
> procedure:
> 1. Estimating the magnitude of the correlation:
> - the sample data should come from a bivariate normal distribution
> (?cor, ?cor.test, Dalgaard 2003, somewhat implied in many examples
> such as ?rrcov::maryo or Wilcox 2005)
> - the sample data should be (I presume univariate) normal (Crawley
> 2007)
> - the sample data can be of any distribution (if I understand
> correctly the `distribution-free' definition of correlation in Huber
> 1981, 2004)
> - the sample data could come from just about any bivariate
> distribution (Wikipedia [2][3] and associated reference)
> - the coefficient is (very) not robust to univariate outliers (e.g.,
> Huber 1981), and to multivariate outliers (?rrcov::maryo with data
> from Marona and Yohai 1998)
>
> 2. Assessing whether the correlation is significantly different from
> zero (using a statistic following the t distribution):
> - the data should come from independent normal distributions (?cor.test)
> - at least one of the marginal distributions is normal (Wilcox 2005)
>
> Surprisingly (to me) many sources seem quite evasive on clearly
> defining the pearson correlation. Reading the literature I was pretty
> much convinced that the correlation coefficient is not robust to
> outliers. The literature is also convincing on the impact of
> contaminated normal, heavy-tailed distributions on parametric tests
> (invalidating their results). However, I'm not clear on the
> distributional assumptions on the data:
> - does the data have to be bivariate normal in order to correctly
> estimate the linear correlation?
> - does the data have to be univariate normal in order to correctly
> estimate the significance of the correlation?
>
> If the above is true, what are the preferable alternatives for
> non-gaussian data (including heavy-tailed normal)? non-parametric
> tests (spearman, kendall)? the robust MASS::cov.mcd, rrcov::CovOgk,
> robust::covRob()? hypothesis testing via Permutation Tests [4]? is
> there a robust cor.test? other robust tests of independence?
>
> Thank you,
> Liviu
>
> [1] http://www.nabble.com/Correlation-on-Tick-Data-tp18589474p18595197.html
> [2] http://en.wikipedia.org/wiki/Correlation#Sensitivity_to_the_data_distribution
> [3] http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Sensitivity_to_the_data_distribution
> [4] http://www.burns-stat.com/pages/Tutor/bootstrap_resampling.html#permtest
>
>
>
> --
> Do you know how to read?
> http://www.alienetworks.com/srtest.cfm
> Do you know how to write?
> http://garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list