[BioC] GO.db: how to get GO Term
Martin Morgan
mtmorgan at fhcrc.org
Tue Jun 23 15:25:46 CEST 2009
Wacek Kusnierczyk wrote:
> Marc Carlson wrote:
>> One thing you can do to make this more efficient is to use mget instead
>> of as.list(). That way you won't be pulling the whole mapping out of
>> the database into a list just to get one thing back out.
>>
>> mget("GO:0000166",GOTERM,ifnotfound=NA)
>>
>> Also, with mget() you can pass in multiple accessions into the 1st
>> argument and it will just hand you a longer list back.
>>
>> mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA)
>>
>
> just being curious, i have checked the performance of all three
> solutions posted on this list:
>
> library(GO.db)
> library(rbenchmark)
>
> ids = sapply(sample(GOTERM, 100), GOID)
> print(
> benchmark(replications=100, columns=c('test', 'elapsed'),
> eapply=eapply(GOTERM[ids], Term),
> lapply=lapply(as.list(GOTERM[ids]), Term),
> mget=lapply(mget(ids, GOTERM), Term)))
>
> # test elapsed
> # 3 eapply 10.925
> # 1 lapply 11.091
> # 2 mget 11.160
>
> it appears that they are (with the particular data sample used)
> virtually equivalent in efficiency.
I'm not the definitive source for this, but I guess the performance is
dominated by, on the one hand, creating S4 instances for each table
entry (e.g., in as.list), and on the other immediately extracting a slot
from the created S4 object.
With this in mind, I thought one could do
tbl <- toTable(GOTERM[ids])
res1 <- with(tbl, Term[!duplicated(go_id)])
identical(sort(unlist(res0)), sort(res1))
This is about 10x faster, but now I'm starting to appreciate some of the
work the software is doing -- there are duplicate go_ids returned by
toTable, corresponding to synonyms for the terms I've entered.
Inspired by this success, I looked at the underlying SQL schema (with
GO_dbschema()) and intercepted at few calls to the db (with
debug(dbGetQuery)) to arrive at this
sql <- sprintf("SELECT DISTINCT term
FROM go_term
LEFT JOIN go_synonym
ON go_term._id=go_synonym._id
WHERE go_term.go_id IN ('%s');",
paste(ids, collapse="','"))
res2 <- dbGetQuery(GO_dbconn(), sql)[[1]]
identical(sort(res1), sort(res2))
another 2x gain in speed, but also really paying a significant price in
terms of responsibility for what the code is doing.
Martin
> vQ
More information about the Bioconductor
mailing list