[BioC] GO.db: how to get GO Term

Tue Jun 23 15:25:46 CEST 2009

Wacek Kusnierczyk wrote:
> Marc Carlson wrote:
>> One thing you can do to make this more efficient is to use mget instead
>> of as.list().  That way you won't be pulling the whole mapping out of
>> the database into a list just to get one thing back out.
>>
>> mget("GO:0000166",GOTERM,ifnotfound=NA)
>>
>> Also, with mget() you can pass in multiple accessions into the 1st
>> argument and it will just hand you a longer list back.
>>
>> mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA)
>>   
> 
> just being curious, i have checked the performance of all three
> solutions posted on this list:
> 
>    library(GO.db)
>    library(rbenchmark)
> 
>    ids = sapply(sample(GOTERM, 100), GOID)
>    print(
>        benchmark(replications=100, columns=c('test', 'elapsed'),
>            eapply=eapply(GOTERM[ids], Term),
>            lapply=lapply(as.list(GOTERM[ids]), Term),
>            mget=lapply(mget(ids, GOTERM), Term)))
> 
>    #     test elapsed
>    # 3 eapply  10.925
>    # 1 lapply  11.091
>    # 2   mget  11.160
> 
> it appears that they are (with the particular data sample used)
> virtually equivalent in efficiency.

I'm not the definitive source for this, but I guess the performance is
dominated by, on the one hand, creating S4 instances for each table
entry (e.g., in as.list), and on the other immediately extracting a slot
from the created S4 object.

With this in mind, I thought one could do

    tbl <- toTable(GOTERM[ids])
    res1 <- with(tbl, Term[!duplicated(go_id)])
    identical(sort(unlist(res0)), sort(res1))

This is about 10x faster, but now I'm starting to appreciate some of the
work the software is doing -- there are duplicate go_ids returned by
toTable, corresponding to synonyms for the terms I've entered.

Inspired by this success, I looked at the underlying SQL schema (with
GO_dbschema()) and intercepted at few calls to the db (with
debug(dbGetQuery)) to arrive at this

    sql <- sprintf("SELECT DISTINCT term
                    FROM go_term
                       LEFT JOIN go_synonym
                       ON go_term._id=go_synonym._id
                    WHERE go_term.go_id IN ('%s');",
                   paste(ids, collapse="','"))
    res2 <- dbGetQuery(GO_dbconn(), sql)[[1]]
    identical(sort(res1), sort(res2))

another 2x gain in speed, but also really paying a significant price in
terms of responsibility for what the code is doing.

Martin

> vQ