[R] Question about Kolmogorov-Smirnov test behavior
peter dalgaard
pdalgd at gmail.com
Thu Jan 7 15:29:13 CET 2016
On 07 Jan 2016, at 14:09 , Shea Lutton <shea at eagleseven.com> wrote:
> Dear R-Help,
> I am trying to understand the output of the KS test on a pair of files. I am trying to determine if the CDF of one distribution is less than (to the left of) the CDF of a second distribution. My problem is that regardless of whether I run A against B, or B against A, the KS output seems to indicate significance that A is less than B AND B is less than A. Can anybody help me understand where my mistake is or if I am misinterpreting the results?
>
>
> Here is my code:
>
> file_a = readLines("./file_a.txt")
> file_b = readLines("./file_b.txt")
> a <- as.numeric(file_a)
> b <- as.numeric(file_b)
> ks.test(b, a, alternative = "less")
> ks.test(a, b, alternative = "less")
>
>
> And here is the output:
>
> Two-sample Kolmogorov-Smirnov test
>
> data: b and a
> D^- = 0.087769, p-value < 2.2e-16
> alternative hypothesis: the CDF of x lies below that of y
>
> Two-sample Kolmogorov-Smirnov test
>
> data: a and b
> D^- = 0.085083, p-value < 2.2e-16
> alternative hypothesis: the CDF of x lies below that of y
>
>> plot(ecdf(a), col = "blue")
>> plot(ecdf(b), add = TRUE, col = "red", lty = 1, pch = 26)
>> plot(density(a))
>> lines(density(b), col = "red")
>
>
> My data files can be found here, they are simple columns of numbers.
> file_a.txt : http://pastebin.com/e3bmnEDt
> file_b.txt : http://pastebin.com/5VBzHRXZ
>
This effect can be generated quite easily by simulation:
> a <- rnorm(1000) ; b <-rnorm(1000, sd=10)
> ks.test(a, b, alternative="less")
Two-sample Kolmogorov-Smirnov test
data: a and b
D^- = 0.394, p-value < 2.2e-16
alternative hypothesis: the CDF of x lies below that of y
> ks.test(b, a, alternative="less")
Two-sample Kolmogorov-Smirnov test
data: b and a
D^- = 0.412, p-value < 2.2e-16
alternative hypothesis: the CDF of x lies below that of y
The cause should be quite apparent if you do
plot(ecdf(b))
plot(ecdf(a), add=T)
and
plot(function(x)ecdf(a)(x)-ecdf(b)(x), from=-10, to=10)
The basic point is that since KS looks at a maximum difference, two CDFs may deviate in bothe the positive and the negative direction at the same time.
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-help
mailing list