[BioC] GOstats::geneIdUniverse() and its relation to organism annotation db's

Steve Lianoglou mailinglist.honeypot at gmail.com
Tue Sep 7 20:15:54 CEST 2010


Hi Robert,

In my hunt to find "the bug" I didn't stumble upon the difference
between org.Sc.sgdGO and org.Sc.sgdGO2ALLORFS (I don't think I
realized the latter one even existed).

I'm guessing the problem I thought I saw was due to my oversight and
is exactly as you say. Thanks for the thorough explanation and
references.

Cheers,
-steve

On Tue, Sep 7, 2010 at 9:16 AM, Robert M. Flight <rflight79 at gmail.com> wrote:
> Hi Steve,
>
> I wonder if you are running into the problem that generally when you
> query for the GO terms you get only those GO terms that the ORF is
> directly annotated with (this would correspond to the actual
> annotations on the GO website, for instance), however when calculating
> enrichment, not only the directly annotated terms but also the parent
> terms (less specific) are included.
>
> I'm thinking of the difference between "org.Sc.sgdGO", which returns
> "org.Sc.sgdGO is an R object that provides mappings between ORF
> identifiers and the GO identifiers that they are directly associated
> with. This mapping and its reverse mapping do NOT associate the child
> terms from the GO ontology with the gene. Only the directly evidenced
> terms are represented here.", and a mapping that returns also the
> other GO terms in the directed acyclic graph (up the tree
> essentially). This would be the reverse of "org.Sc.sgdGO2ALLORFS".
>
> For example, any gene directly annotated to "GO:0007155 : cell
> adhesion", will also by virtue of the DAG be indirectly annotated with
> "GO:0022610 : biological adhesion", whether or not it is directly
> annotated with that or not. When GO enrichment using GOHyperGTest is
> calculated, all of the indirectly annotated terms are considered as
> well. (see this link:
> http://amigo.geneontology.org/cgi-bin/amigo/browse.cgi?action=plus_node&target=GO:0022610&open_1=GO:0008150,all,amigo1283864733)
>
> This seems to be the cause of your disparity below. You might want to
> read a bit more about the GO and how enrichment calculations are done.
> A good paper describing the nuts and bolts of most packages is Gavin
> Sherlocks paper on GO::TermFinder
> (http://www.ncbi.nlm.nih.gov/pubmed/15297299), which is the basis for
> most other enrichment tools, including GOStats.
>
> If I do:
>
> crud <- mget("GO:0007059", envir=org.Sc.sgdGO2ALLORFS)
> crud2 <- unlist(crud)
> grep("YKL049C", crud2)
>
> I get 135 as the result.
>
> Cheers,
>
> -Robert
>
> Robert M. Flight, Ph.D.
> Bioinformatics and Biomedical Computing Laboratory
> University of Louisville
> Louisville, KY
>
> PH 502-852-0467
> EM robert.flight at louisville.edu
> EM rflight79 at gmail.com
>
> Williams and Holland's Law:
>        If enough data is collected, anything may be proven by
> statistical methods.
>
>
>
> On Mon, Sep 6, 2010 at 05:21, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>> Hi,
>>
>> I've been trying to do some non-run-of-the-mill gene ontology analysis
>> and have been messing around with the innards of the results we get
>> from GOstats::hyperGTest.
>>
>> As a result, I've found some peculiarities in the GO annotations I'm
>> retrieving from the hyperGTest and how they relate to the GO
>> annotations in my organism's (saccharomyces cerevisiae) annotation
>> database.
>>
>> In particular: for some enriched GO term X, I'm finding genes listed
>> in the geneIdUniverse of X that do not have X as a member of their
>> organism.db.GO map, and I'm not sure how that could be. I'm even
>> setting condition=FALSE to my hyperGTest to make it as "vanilla" as
>> possible (but as far as I understand the conditional test, this
>> wouldn't effect what I'm seeing anyway since genes are removed from GO
>> terms, and not added).
>>
>> I have a specific example below. You'll find that for the first
>> significant GOID found (GO:0007059), gene "YKL049C" is listed as a
>> member of this GOID's geneIdUniverse as it is returned from the
>> hyperGTest, but GO:0007059 isn't listed in my annotation db GO mapping
>> (org.Sc.sgdGO) for this gene, nor is "YKL049C" found in its GO2ORF
>> map.
>>
>> === Example ===
>>
>> library(GOstats)
>> library(org.Sc.sgd.db)
>> library(annotate)
>>
>> orfs <- c(
>>  "YGR084C", "YDR409W",   "YFR017C",   "YDR342C",   "YBR152W",
>>  "YIL014W", "YGR134W",   "YDL109C",   "YDL234C",   "YDR524C",
>>  "YBR002C", "YGR063C",   "YDR034C-A", "YIR011C",
>>  "YHR175W", "YHR109W",   "YGR207C",   "YCR023C",   "YER039C-A",
>>  "YGL064C", "YGR249W",   "YHR172W",   "YGL226W",   "YAL064W",
>>  "YDR309C", "YDR473C",   "YAL058W",   "YHL030W",   "YKL101W",
>>  "YJL122W", "YBR121C",   "YIL018W",   "YDR443C",   "YBR069C",
>>  "YBR011C", "YHR009C",   "YGL166W",   "YIL103W",   "YDR206W",
>>  "YBL031W", "YBR162W-A", "YFL018C",   "YBR214W",   "YIL076W",
>>  "YBR291C", "YKL049C",   "YIL149C",   "YDR034W-B", "YDR395W")
>>
>> orf2go <- getAnnMap('GO', 'org.Sc.sgd')
>> go2orf <- getAnnMap('GO2ORF', 'org.Sc.sgd')
>> all.orfs <- keys(orf2go)
>>
>> ## Do GOstats test to get significant GO ontologies
>> params <- new("GOHyperGParams", geneIds=orfs,
>>              universeGeneIds=all.orfs, annotation='org.Sc.sgd',
>>              ontology='BP', pvalueCutoff=0.05,
>>              conditional=FALSE, testDirection='over')
>> GO <- hyperGTest(params)
>>
>> ## Look at the genes annotated in the universe for each GOID and compare
>> ## with annotations for each gene in org.Sc.sgdGO
>> universes <- geneIdUniverse(GO)
>> uid <- names(universes)[1] ## GO:0007059
>> uid.members <- universes[[uid]]
>> "YKL049C" %in% uid.members ## TRUE
>>
>> ## Is the gene listed in GOID's (uid) universe annotated with that GOID?
>> ## Given previous result, this *should* be TRUE
>> go.ids <- orf2go[["YKL049C"]]
>> uid %in% names(go.ids) ## FALSE
>>
>>
>> ## Does that GOID have this gene as its member?
>> ## This also should be TRUE
>> check.orfs <- go2orf[[uid]]
>> "YKL049C" %in% check.orfs ## FALSE
>>
>> ======================================
>>
>> Since `"YKL049C" %in% uid.members` is TRUE, I would expect the go term
>> `uid` (GO:0007059) to be in both `names(go.ids)` and `check.orfs`, but
>> it's not.
>>
>> Can someone shed some light on this for me?
>>
>> sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>  [1] GO.db_2.4.1          annotate_1.26.1      org.Sc.sgd.db_2.4.1
>> GOstats_2.14.0
>>  [5] RSQLite_0.9-2        DBI_0.2-5            graph_1.26.0
>> Category_2.14.0
>>  [9] AnnotationDbi_1.10.2 Biobase_2.8.0        devtools_0.1
>> stringr_0.4
>> [13] roxygen_0.1-2        profr_0.1.1          digest_0.4.2
>> testthat_0.3
>>
>> loaded via a namespace (and not attached):
>>  [1] GSEABase_1.10.0   RBGL_1.24.0       XML_3.1-1
>> evaluate_0.3      genefilter_1.30.0
>>  [6] plyr_1.1          splines_2.11.1    survival_2.35-8
>> tools_2.11.1      xtable_1.5-6
>>
>>
>> Thanks,
>> -steve
>>
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list