[R] Multivariate hypergeometric distribution version of phyper()

Peter Ehlers ehlers at ucalgary.ca
Tue Mar 30 19:04:20 CEST 2010


Karl,

I strongly support Chuck's recommendations.
If you do still want to compute such probabilities 'by hand',
you could consider the lchoose() function which does work
for your example.

  -Peter Ehlers

On 2010-03-30 9:55, Charles C. Berry wrote:
> On Tue, 30 Mar 2010, Karl Brand wrote:
>
>> Dear R Users,
>>
>> I employed the phyper() function to estimate the likelihood that the
>> number of genes overlapping between 2 different lists of genes is due
>> to chance. This appears to work appropriately.
>>
>> Now i want to try this with 3 lists of genes which phyper() does not
>> appear to support.
>>
>> Some googling suggests i can utilize the Multivariate hypergeometric
>> distribution to achieve this. eg.:
>>
>> http://en.wikipedia.org/wiki/Hypergeometric_distribution
>>
>> But when i try to do this manually using the choose() function (see
>> attempt below example with just two gene lists) i'm unable to perform
>> the calculations- the numbers hit infinity before getting an answer.
>>
>> Searching cran archives for "Multivariate hypergeometric" show this
>> term in the vignettes of package's ‘combinat’ and ‘forward’. But i'm
>> unable to make sense of the these pachakege functions in the context
>> of my aforementioned apllication.
>>
>> Can some one suggest a function, script or method to achieve my goal
>> of estimating the likelyhood of overlap between 3 lists of genes,
>> ideally using the multivariate hypergeometric, or anything else for
>> that matter?
>
>
> Two suggestions:
>
> 1) Don't! Likely the theory is unsuited for the application. In
> most applications that generate lists of genes, the genes are
> not iid realizations and the hypergeometric gives results that
> are astonishingly anticonservative. As an alternative , the
> block bootstrap may be suitable. See
> http://171.66.122.45/cgi/content/abstract/17/6/760
>
> and Google (scholar) 'genomic "block bootstrap"' for some
> starting points.
>
>
> 2) Take this thread to the bioconductor list. You are much
> more likely to get pointers to useful packages and functions
> for genomic statistical software there.
>
> HTH,
>
> Chuck
>
>
>>
>> cheers in advance,
>>
>> Karl
>>
>>
>>
>> #example attempt with two gene lists m & n
>> N <- 45101 # total number balls in urn
>> m <- 720 # number of 'white' or 'special' balls in urn, aka 'success'
>> n <- 801 # number balls drawn or number of samples
>> k <- 40 # number of 'white' or 'special' balls DRAWN
>>
>> a <- choose(m,k)
>> b <- choose((N-m),(n-k))
>> z <- choose(N,n)
>> prK <- (a*b)/z #'the answer'
>> print(prK)
>> [1] NaN
>>
>>> a
>> [1] 7.985852e+65
>>> b
>> [1] Inf
>>> z
>> [1] Inf
>>
>>
>> --
>> Karl Brand
>> Department of Genetics
>> Erasmus MC
>> Dr Molewaterplein 50
>> 3015 GE Rotterdam
>> T +31 (0)10 704 3457 | F +31 (0)10 704 4743 | M +31 (0)642 777 268
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Charles C. Berry (858) 534-2098
> Dept of Family/Preventive Medicine
> E mailto:cberry at tajo.ucsd.edu UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Ehlers
University of Calgary



More information about the R-help mailing list