[BioC] transcriptsBy via TxDb.Hsapiens.UCSC.hg19.knownGene painfully slow

Wed Jan 2 07:21:37 CET 2013

Hi Murat,

On 01/01/2013 02:23 PM, Murat Tasan wrote:
> forgot to reply to the list...
>
> here's the full output (not including the result of the last timing line,
> since that's the offender):
>
> ########################################
>
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> Loading required package: GenomicFeatures
> Loading required package: BiocGenerics
>
> Attaching package: â€˜BiocGenericsâ€™
>
> The following object(s) are masked from â€˜package:statsâ€™:
>
>      xtabs
>
> The following object(s) are masked from â€˜package:baseâ€™:
>
>      anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get,
> intersect, lapply, Map,
>      mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
> rbind, Reduce, rep.int,
>      rownames, sapply, setdiff, table, tapply, union, unique
>
> Loading required package: IRanges
> Loading required package: GenomicRanges
> Loading required package: AnnotationDbi
> Loading required package: Biobase
> Welcome to Bioconductor
>
>      Vignettes contain introductory material; view with 'browseVignettes()'.
> To cite Bioconductor,
>      see 'citation("Biobase")', and for packages 'citation("pkgname")'.
>
>> TXDB <- TxDb.Hsapiens.UCSC.hg19.knownGene
>
>
>> sessionInfo()
>
>
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_US.UTF-8
>   [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
>   LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0 GenomicFeatures_1.10.1
> [3] AnnotationDbi_1.20.3                    Biobase_2.18.0
> [5] GenomicRanges_1.10.5                    IRanges_1.16.4
> [7] BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.14.0     Biostrings_2.26.2  bitops_1.0-5
> BSgenome_1.26.1    DBI_0.2-5
>   [6] parallel_2.15.0    RCurl_1.95-3       Rsamtools_1.10.2
> RSQLite_0.11.2     rtracklayer_1.18.1
> [11] stats4_2.15.0      tools_2.15.0       XML_3.95-0.1       zlibbioc_1.4.0
>
> ########################################
>
>
> our sessions look pretty much identical, with the exception of R 2.15.0 for
> me and 2.15.2 for you.
> i'll try to push an upgrade in the next day or so and see if that might
> make a difference.
>
> it also occurred to me that the SQLite query can't be the offender, since
> this line runs perfectly swiftly:
>> foo <- select(TXDB, keys = keys(TXDB), cols = c("TXCHROM", "TXSTRAND",
> "TXSTART", "TXEND"), keytype = "GENEID")

A fair comparison would be to also select the TXID and TXNAME cols
because transcriptsBy() extracts them:

   foo <- select(TXDB, keys = keys(TXDB), cols = c("TXCHROM", 
"TXSTRAND", "TXSTART", "TXEND", "TXID", "TXNAME"), keytype = "GENEID")

I'm not sure why, but requesting those 2 additional cols slows down
select() by a factor 10x for me (from 1 sec to 10 sec).

One way to make sure the SQLite query isn't the offender is to use
the SQLite client (sqlite3) from the Unix command line to query the
TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite file directly. Try the
following query which is more or less the query used by
transcriptsBy(. , by="gene"):

   SELECT transcript._tx_id AS tx_id, tx_name, tx_chrom, tx_strand, 
tx_start, tx_end, gene_id FROM transcript INNER JOIN gene ON 
(transcript._tx_id=gene._tx_id) WHERE gene_id IS NOT NULL ORDER BY 
gene_id, tx_chrom, tx_strand, tx_start, tx_end;

On my laptop:

   time sqlite3 TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite 'SELECT 
transcript._tx_id AS tx_id, tx_name, tx_chrom, tx_strand, tx_start, 
tx_end, gene_id FROM transcript INNER JOIN gene ON 
transcript._tx_id=gene._tx_id WHERE gene_id IS NOT NULL ORDER BY 
gene_id, tx_chrom, tx_strand, tx_start, tx_end' > sql.result

   real	0m0.507s
   user	0m0.368s
   sys	0m0.136s

FWIW, I remember SQLite queries being painfully slow when the sqlite
file is located on OCFS (Oracle Cluster File System), but that was a
long time ago (with an old version of OCFS). Could be worth checking
the file system where your packages are installed (your .libPaths()
folder).

Cheers,
H.

>
> so i'm guessing something in the GRangesList construction might be going
> haywire?
>
> cheers,
>
> -m
>
>
> On Tue, Jan 1, 2013 at 5:11 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>> On 01/01/2013 02:05 PM, Martin Morgan wrote:
>>
>>> On 01/01/2013 01:32 PM, Murat Tasan wrote:
>>>
>>>> hi all - does anyone have any performance tips for using
>>>> transcriptsBy(TXDB, by = "gene") with the UCSC transcript database?
>>>> in particular, is the SQLite backing database file indexed (along columns
>>>> holding the internal IDs)?
>>>> i'd provide some timing results for the command execution, but i ran out
>>>> of
>>>> patience after about 10 minutes with no results...
>>>>
>>>
>>> it is 'slow' but only in the couple of seconds definition of slow.
>>> Something
>>> else is going on so a reproducible example, including sessionInfo(),
>>> would be
>>> helfpul.
>>>
>>
>> Just to follow my own advice...
>>
>>
>>    library(TxDb.Hsapiens.UCSC.**hg19.knownGene)
>>    system.time(res <- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene,
>> by="gene"))
>>    length(res)
>>    sessionInfo()
>>
>> gives me
>>
>>>    library(TxDb.Hsapiens.UCSC.**hg19.knownGene)
>>>    system.time(res <- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene,
>> by="gene"))
>>     user  system elapsed
>>    3.020   0.012   3.042
>>>    length(res)
>> [1] 22932
>>>    sessionInfo()
>> R version 2.15.2 Patched (2012-12-23 r61401)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=C                 LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0
>> [2] GenomicFeatures_1.10.1
>> [3] AnnotationDbi_1.20.3
>> [4] Biobase_2.18.0
>> [5] GenomicRanges_1.10.5
>> [6] IRanges_1.16.4
>> [7] BiocGenerics_0.4.0
>>
>> loaded via a namespace (and not attached):
>>   [1] biomaRt_2.14.0     Biostrings_2.26.2  bitops_1.0-5
>> BSgenome_1.26.1
>>   [5] DBI_0.2-5          parallel_2.15.2    RCurl_1.95-3
>> Rsamtools_1.10.2
>>   [9] RSQLite_0.11.2     rtracklayer_1.18.1 stats4_2.15.2      tools_2.15.2
>> [13] XML_3.95-0.1       zlibbioc_1.4.0
>>
>>
>>
>>>
>>>
>>>> cheers,
>>>>
>>>> -m
>>>>
>>>>      [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>
>>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319