[R] A goodness of fit test for two discrete distributions with unequal variance?
Serena De Stefani
@eren@de@te|@n| @end|ng |rom gm@||@com
Fri Aug 23 23:52:55 CEST 2019
I have a computer simulation in which a virtual agent end up in different
areas of a layout based on several factors. There are 18 conditions in
total.
If I collapse the datapoint into bins, where each bin is one of the areas,
the data would look like this:
x0 <- c(3,3,5,5,2) # computer simulation
Now I would like to validate this model having human subjects going trough
the same conditions, but I run into two sets of issues:
1. the first issue is due to the fact that the dataset is discrete and
small (there may be less than 5 counts in a bin, and that's a problem for a
Chi-Square Goodness of Fit test), also there may be ties. After some online
digging I found two options:
- a permutation test
- a Cramer-von Mises test of goodness-of-fit (see this paper
<https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf>
https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf)
I thought the Cramer-von Mises test of goodness-of-fit test could work, so
I ran it with made-up data for *one human subject* and I get the following
result:
x0 <- c(3,3,5,5,2) # computer simulation
x1 <- c(4,2,5,4,3) # subject 1
library(goftest)
cvm.test(x0, ecdf(x1))
>Cramer-von Mises test of goodness-of-fit
>Null hypothesis: distribution ‘ecdf(x1)’
>data: x0
>omega2 = 0.14667, p-value = 0.4106
So far so good. But now let’s say I would like to have more than one human
subject, let’s say four of them. These are the results from the additional
subjects:
x2 <- c(3,3,5,2,5) # subject 2
x3 <- c(2,2,5,6,3) # subject 3
x4 <- c(3,2,5,6,2) # subject 4
Now I run in the second set of issues:
2. on the one side I have a single computer simulation, on the other side I
have data from four subjects. Should I take the mean of the results for the
human subjects? Then would my data still be “discrete”? Or should I run my
simulation four times? But I would get always the same results, so the
variance between the two datasets would be different.
Any ideas? Maybe I should change the design and have more levels for my
factors, so that I have more trials and the bins get bigger?
[[alternative HTML version deleted]]
More information about the R-help
mailing list