[R] A goodness of fit test for two discrete distributions with unequal variance?

Fri Aug 23 23:52:55 CEST 2019

I have a computer simulation in which a virtual agent end up in different
areas of a layout based on several factors. There are 18 conditions in
total.
If I collapse the datapoint into bins, where each bin is one of the areas,
the data would look like this:

    x0 <- c(3,3,5,5,2) # computer simulation

Now I would like to validate this model having human subjects going trough
the same conditions, but I run into two sets of issues:

 1. the first issue is due to the fact that the dataset is discrete and
small (there may be less than 5 counts in a bin, and that's a problem for a
Chi-Square Goodness of Fit test), also there may be ties. After some online
digging I found two options:
- a permutation test
- a Cramer-von Mises test of goodness-of-fit (see this paper
<https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf>
 https://journal.r-project.org/archive/2011/RJ-2011-016/RJ-2011-016.pdf)

I thought the Cramer-von Mises test of goodness-of-fit test could work, so
I ran it with made-up data for *one human subject* and I get the following
result:

    x0 <- c(3,3,5,5,2) # computer simulation
    x1 <- c(4,2,5,4,3) # subject 1

    library(goftest)

    cvm.test(x0, ecdf(x1))

    >Cramer-von Mises test of goodness-of-fit
>Null hypothesis: distribution ‘ecdf(x1)’

    >data:  x0
    >omega2 = 0.14667, p-value = 0.4106

So far so good. But now let’s say I would like to have more than one human
subject, let’s say four of them. These are the results from the additional
subjects:

    x2 <- c(3,3,5,2,5) # subject 2
    x3 <- c(2,2,5,6,3) # subject 3
    x4 <- c(3,2,5,6,2) # subject 4

Now I run in the second set of issues:

2. on the one side I have a single computer simulation, on the other side I
have data from four subjects. Should I take the mean of the results for the
human subjects? Then would my data still be “discrete”? Or should I run my
simulation four times? But I would get always the same results, so the
variance between the two datasets would be different.

Any ideas? Maybe I should change the design and have more levels for my
factors, so that I have more trials and the bins get bigger?

	[[alternative HTML version deleted]]