[R] generate Binomial (not Binary) data

(Ted Harding) Ted.Harding at manchester.ac.uk
Fri Feb 9 17:21:58 CET 2007

On 09-Feb-07 Duncan Murdoch wrote:
> On 2/7/2007 8:20 AM, Marc Bernard wrote:
>> Dear All,
>>   I am looking for an R function or any other reference to generate a
>>   series of correlated Binomial (not a Bernoulli) data. The "bindata"
>>   library can do this for the binary not the binomial case.
> Ted asked how you want your series correlated.  Another question is how
> you want it "binomial":  do you want each value to be binomial with 
> parameters conditional on previous values, or do you want each to be 
> marginally binomial with some fixed parameters?
> I can only think of two trivial solutions that meet both conditions 
> simultaneously:  the i.i.d. sequence, and a sequence that repeats 0 
> indefinitely.  I'd be interested in hearing if anyone knows of any 
> non-trivial examples of sequences where the distributions are both 
> conditionally and marginally binomial, or a proof that there are none.
> Duncan Murdoch

I've been thinking a bot more about this issue, and (somewhat on
the same olines as Duncan above) have explicitly identified some

Gregor Gorjanc, replying to Bernard, wrote:
>> The "bindata" library can do this for the binary not the
>> binomial case.
> But binomial is just a sum of Bernoulli vars so this should
> be doable. 

Suppose, following Grigor's suggestion, we make up correlated binomial
variables as sums of correlated Bernoulli variables.

I.e. we have 2 binomial variabes:

  X = U1 + U2 + ... + Un
  Y = V1 + V2 + ... + Vn

where {U1,...,Un} are independent bernoullis (0/1) with P["1"] = p,
and similar for {V1,...,Vn}. Then X and Y are Binomial (n,p).

Suppose that cor(U1,V1,) = r, which depends uniquely on

  P[U1=1 & V1=1] = p11

say, and similarly for (U2,V2), (U3,V3), ...

Then X and Y will be correlated, and their correlation can be
calculated in terms of p11. So this is one model that can generate
correlated binomial variables.

However, while a bivariate normal (given means and variances)
is uniquely defined by the correlation coefficient, this is not
the case for a bivariate binomial:

Let U be binomial(m,p), V binomial((n-m),p), W binomial((n-m),p)
and all three independent.


  X = U + V is binomial(n,p)
  Y = U + W is binomial(n,p)

and X,Y are correlated, with correlation coefficient (if I've
done it right) m/(m+n). So, while in general you cannot choose
m (outof n) to get an exact correlation coefficient c, this is
a different kind of model to generate correlated binomial variables.

There will be many others (some easier to think up then others).

C. The above concern only correlation between two binomial
variables (X,Y). Bernard clarified his original query in a private
email to me:

  "Let i indexes  a subject and Y_i = (Y_i1, Y_i2,...,Y_iT)
   be a vector of binomial variables for subject i  such that
   Y_it ~ Bin(n,p_t) with t = 1,2, ....T.
   A simple correlation I would like to have is :
   corr(Y_ij, Y_ik) = c for all (j,k)"

(I think the index i is irrelevant to the present query).

This could be done on the lines of case (B) above with

  Y_i1 = U + V1
  Y_i2 = U + V2
  Y_iT = U + VT

provided the desired value c is adequately approximated by m/(m+n).

Alternatively, you can start off from a situation inspired by (A):

Let W = U + V | U+V <=1, i.e. X is the sum of two independent
bernoullis conditional on excluding the case where both are 1.

Then W is bernoulli with

  P[W=0] = q^12/(q^2 + 2*p*q), P[W=1] = 2*p*q/(q^2 + 2*p*q)

where p = P[U=1] = P[V=1] and q = (1-p). Let P[W=1] = p11.

Then create binomial(n,p11) variables Y_i1, Y_i2, ... , Y_iT by

  Y_i1 = (U1 + V11) + (U2 + V12) + ... + (Un + V1n)
  Y_i2 = (U1 + V21) + (U2 + V22) + ... + (Un + V2n)
  Y_iT = (U1 + VT1) + (U2 + VT2) + ... + (Un + VTn)

where each (Ui + Vij) is conditional on U+V <= 1 as in (B)..

Then we have pairwise equi-correlated variables Y_i1,...,Y_iT.

Although the two methods described above (i.e. in (C)) give
pairwise equicorrelated binomial variables, I suspect that
their higher-order moments, and indeed the joint distribution
of (Y_ir,Y_is), will not be the same in the two methods.

So, as I see it, there is not a unique answet to the question.
[Primitive versions of the above led me to ask the question

 How do you want your series of binomial datato be "correlated"?]

Anyone wiser than I am?


E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 09-Feb-07                                       Time: 16:13:14
------------------------------ XFMail ------------------------------

More information about the R-help mailing list