[R] outlier identification: is there a redundancy-invariant substitution for mahalanobis distances?
"Jens Oehlschlägel"
joehl at gmx.de
Wed Jan 21 17:57:21 CET 2004
Dear R-experts,
Searching the help archives I found a recommendation to do multivariate
outlier identification by mahalanobis distances based on a robustly estimated
covariance matrix and compare the resulting distances to a chi^2-distribution
with p (number of your variables) degrees of freedom. I understand that
compared to euclidean distances this has the advantage of being scale-invariant.
However, it seems that such mahalanobis distances are not invariant to
redundancies: adding a highly collinear variable changes the mahalanobis distances
(see code below). Isn't also the comparision to chi^2 assuming that all
variables are independent?
Can anyone recommend a procedure to calculate distances and identify
multivariate outliers which is invariant to the degree of collinearity?
Thanks to any advice
Jens Oehlschlägel
# Example code
library(MASS)
# generate bivariate normal test data
n <- 500
x <- matrix(rnorm(n*2), ncol=2)
# scale, otherwise euclidean fails
x <- scale(x)
cr <- cov.rob(x, method="mcd")
center <- cr$center
# calculate squared euclidean and mahalanobis
d <- rowSums(t(t(x)-center)^2)
m <- as.vector(mahalanobis(x, center, cr$cov))
# euclidean an dmahalanobis basically coincide, mahalanobis slightly biased
by robust covariance underestimation
eqscplot(x=d, y=m); abline(0,1)
# Now I add a highly redundant column in hope the distances between cases
will not change
x2 <- cbind(x, x[,1]+rnorm(n, sd=0.01))
# scale, otherwise euclidean fails
x2 <- scale(x2)
cr2 <- cov.rob(x2, method="mcd")
center2 <- cr2$center
d2 <- rowSums(t(t(x2)-center2)^2)
m2 <- as.vector(mahalanobis(x2, center2, cr2$cov))
# though equally scaled, euclidean and mahalanobis diverge
eqscplot(x=d2, y=m2); abline(0,1)
# mahalanobis distances are obviously not redundancy invariant
eqscplot(x=m, y=m2); abline(0,1)
# especially if rank order of distances is considered
eqscplot(x=rank(m), y=rank(m2)); abline(0,1)
cor(m, m2)
cor(m, m2, method="spearman")
# euclidean distances look better but are also not redundancy invariant
eqscplot(x=d, y=d2); abline(0,1)
eqscplot(x=rank(d), y=rank(d2)); abline(0,1)
cor(d, d2)
cor(d, d2, method="spearman")
--
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
More information about the R-help
mailing list