[BioC] GO.db: how to get GO Term
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Wed Jun 24 10:19:49 CEST 2009
Interesting, thanks.
vQ
> Hi Gordon, Martin, Wacek,
>
> Martin Morgan wrote:
>> Wacek Kusnierczyk wrote:
>>> Marc Carlson wrote:
>>>> One thing you can do to make this more efficient is to use mget
>>>> instead
>>>> of as.list(). That way you won't be pulling the whole mapping out of
>>>> the database into a list just to get one thing back out.
>>>>
>>>> mget("GO:0000166",GOTERM,ifnotfound=NA)
>>>>
>>>> Also, with mget() you can pass in multiple accessions into the 1st
>>>> argument and it will just hand you a longer list back.
>>>>
>>>> mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA)
>>>>
>>> just being curious, i have checked the performance of all three
>>> solutions posted on this list:
>>>
>>> library(GO.db)
>>> library(rbenchmark)
>>>
>>> ids = sapply(sample(GOTERM, 100), GOID)
>>> print(
>>> benchmark(replications=100, columns=c('test', 'elapsed'),
>>> eapply=eapply(GOTERM[ids], Term),
>>> lapply=lapply(as.list(GOTERM[ids]), Term),
>>> mget=lapply(mget(ids, GOTERM), Term)))
>>>
>>> # test elapsed
>>> # 3 eapply 10.925
>>> # 1 lapply 11.091
>>> # 2 mget 11.160
>>>
>>> it appears that they are (with the particular data sample used)
>>> virtually equivalent in efficiency.
>>
>> I'm not the definitive source for this, but I guess the performance is
>> dominated by, on the one hand, creating S4 instances for each table
>> entry (e.g., in as.list), and on the other immediately extracting a slot
>> from the created S4 object.
>>
>> With this in mind, I thought one could do
>>
>> tbl <- toTable(GOTERM[ids])
>> res1 <- with(tbl, Term[!duplicated(go_id)])
>> identical(sort(unlist(res0)), sort(res1))
>>
>> This is about 10x faster, but now I'm starting to appreciate some of the
>> work the software is doing -- there are duplicate go_ids returned by
>> toTable, corresponding to synonyms for the terms I've entered.
>>
>> Inspired by this success, I looked at the underlying SQL schema (with
>> GO_dbschema()) and intercepted at few calls to the db (with
>> debug(dbGetQuery)) to arrive at this
>>
>> sql <- sprintf("SELECT DISTINCT term
>> FROM go_term
>> LEFT JOIN go_synonym
>> ON go_term._id=go_synonym._id
>> WHERE go_term.go_id IN ('%s');",
>> paste(ids, collapse="','"))
>> res2 <- dbGetQuery(GO_dbconn(), sql)[[1]]
>
> WARNING: There is no guarantee that the data.frame returned by
> dbGetQuery()
> will have its rows sorted in the same order than the ids injected in the
> IN
> clause of the SQL statement. So, here 'res2' is of length 100 but the
> mapping between 'ids' and 'res2' is lost. Note that this is not a
> dbGetQuery()
> feature but an SQL feature (the same would apply from the sqlite3 command
> line client).
>
> Also note that eapply() has the same kind of problem: one needs to be
> careful with the result of eapply(GOTERM[ids], Term) because, strictly
> speaking, there is no reason why the returned list should have its
> elements in the same order as the keys in 'ids'. This behaviour follows
> in fact what the original eapply() does on a *real* environment
> where, according to the man page, "the output is not sorted" even though
> it seems that it is sorted in the lexicographic order of the symbols
> defined in the environment. But since this is not documented, no code
> should rely on this. Anyway, IMO eapply() should not be used on Bimap
> objects because of this confusing behaviour.
>
>> identical(sort(res1), sort(res2))
>>
>> another 2x gain in speed, but also really paying a significant price in
>> terms of responsibility for what the code is doing.
>
> Even faster (DISTINCT and JOIN not needed):
>
> GOid2Term <- function(ids)
> {
> sql <- sprintf("SELECT go_id, term
> FROM go_term
> WHERE go_id IN ('%s')",
> paste(ids, collapse="','"))
> res <- dbGetQuery(GO_dbconn(), sql)
> ans <- res[[2]]
> names(ans) <- res[[1]]
> unname(ans[ids])
> }
>
> Cheers,
> H.
>
>
>>
>> Martin
>>
>>> vQ
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list