[BioC] End of the line of GOstats: making sense of the hypergeometric test results now

Fri Nov 27 21:08:53 CET 2009

In a gross sense you can tell by the name. 'Cell communication' is a relatively general term, whereas 'positive regulation of cell motion' would be a more specific term. If you are not sure,  you can go to the AmiGO website and look at the ontology for a term (amigo.geneontology.org).

Or you could be really cool and use R to draw the DAG for the term  you are interested in.

Best,

Jim

>>> Massimo Pinto <pintarello at gmail.com> 11/27/09 9:13 AM >>>
Thank you to both.

@Robert: I have run the hyperTest both conditionally and not and noticed
that the effect is rather substantial: the list of significantly implicated
nodes does get shorter when conditional=TRUE.

@James: how do you tell a more general node from a less general? Do you
merely count the gene size of each or do you look at other factors, for
example from the ontology tree?

Thank you
Massimo

Massimo Pinto
Post Doctoral Research Fellow
Enrico Fermi Centre and Italian Public Health Research Institute (ISS), Rome
http://claimid.com/massimopinto

On Wed, Nov 25, 2009 at 3:57 PM, Robert Gentleman <rgentlem at fhcrc.org>wrote:

> Hi,
>  two comments:
> 1) how you interpret the output depends a bit on whether you used
> conditional=TRUE or FALSE (I don't think you have told us). And which you
> use depends on what you are trying to achieve.
>
> 2) the odds ratio is the size of the effect (if you are more comfortable
> with gene expression data then think "fold change") and the p-value (as
> always) tells you how unusual that is under the null hypothesis.  You should
> rank your list by which is most important to you.
>
>  Robert
>
>
> James W. MacDonald wrote:
>
>> Hi Massimo,
>>
>> Massimo Pinto wrote:
>>
>>> Greetings all,
>>>
>>> Having first searched the GMane archives, I suppose the following
>>> question is appropriate. After selecting my 'entrezUniverse', I have
>>> run an hypergeometric test, as implemented in functions provided in
>>> GOstats, and thus obtained a readable, hyperlinked report containing a
>>> list of the ontology nodes that appear to have been significantly
>>> implicated, along with p values, odds ratio, number of significantly
>>> regulated genes that fall in each listed node, etc.
>>>
>>> The report is not exactly short, and I am looking for criteria to
>>> proceed with the interpretation of the results. Specifically, I am
>>> trying to hunt for the most 'interesting' implicated ontology nodes
>>> and, to this end, a marker may be useful. Assuming this line of
>>> thinking is appropriate and focusing on the first few lines of the
>>> report:
>>>
>>>  GO.df.CM3.ctr1.2.3
>>>>
>>>
>>>        GOBPID       Pvalue OddsRatio    ExpCount Count Size
>>>                                                 Term
>>> 1   GO:0040011 9.322848e-05  2.558205  11.8928490    26  145
>>>                                           locomotion
>>> 2   GO:0002376 2.337660e-04  1.887324  28.2147590    47  344
>>>                                immune system process
>>> 3   GO:0007165 2.821193e-04  1.541496  82.4297464   110 1005
>>>                                  signal transduction
>>> 4   GO:0006954 2.840421e-04  2.892962   7.3817683    18   90
>>>                                inflammatory response
>>> 5   GO:0051272 4.985200e-04  6.638731   1.5583733     7   19
>>>                   positive regulation of cell motion
>>> 6   GO:0007154 5.866973e-04  1.493138  88.4992004   115 1079
>>>                                   cell communication
>>>  [...]
>>>
>>> I do wonder whether the correct marker for my hunt is the p value, or
>>> the Odds Ratio, which would rank my list differently. Plus, the
>>> ontology nodes containing the largest number of genes (Size, above)
>>> may be of too broad scope to reveal the presence of a biological
>>> process that is specifically implicated in my experiment. By the same
>>> token, ontology nodes with too few genes may not provide convincing
>>> evidence of their implication.
>>>
>>> Put shortly, what's the suggested strategy to proceed?
>>>
>>
>> The strategy depends on your original hypothesis. If the hypothesis was
>> that inflammation should be a factor in your experimental samples, then you
>> should be looking at #4.
>>
>> If there wasn't a hypothesis, then I would tend to look at the more
>> directed terms first. Something like locomotion is so general as to be
>> useless. However, positive regulation of cell motion would probably be a
>> more tractable ontology to explore.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>> Thank you very much in advance to all of you who will read this post.
>>>
>>> Yours
>>> Massimo
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues