[BioC] transcriptsBy via TxDb.Hsapiens.UCSC.hg19.knownGene painfully slow
Hervé Pagès
hpages at fhcrc.org
Wed Jan 2 07:21:37 CET 2013
Hi Murat,
On 01/01/2013 02:23 PM, Murat Tasan wrote:
> forgot to reply to the list...
>
> here's the full output (not including the result of the last timing line,
> since that's the offender):
>
> ########################################
>
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> Loading required package: GenomicFeatures
> Loading required package: BiocGenerics
>
> Attaching package: ‘BiocGenerics’
>
> The following object(s) are masked from ‘package:stats’:
>
> xtabs
>
> The following object(s) are masked from ‘package:base’:
>
> anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get,
> intersect, lapply, Map,
> mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
> rbind, Reduce, rep.int,
> rownames, sapply, setdiff, table, tapply, union, unique
>
> Loading required package: IRanges
> Loading required package: GenomicRanges
> Loading required package: AnnotationDbi
> Loading required package: Biobase
> Welcome to Bioconductor
>
> Vignettes contain introductory material; view with 'browseVignettes()'.
> To cite Bioconductor,
> see 'citation("Biobase")', and for packages 'citation("pkgname")'.
>
>> TXDB <- TxDb.Hsapiens.UCSC.hg19.knownGene
>
>
>> sessionInfo()
>
>
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0 GenomicFeatures_1.10.1
> [3] AnnotationDbi_1.20.3 Biobase_2.18.0
> [5] GenomicRanges_1.10.5 IRanges_1.16.4
> [7] BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-5
> BSgenome_1.26.1 DBI_0.2-5
> [6] parallel_2.15.0 RCurl_1.95-3 Rsamtools_1.10.2
> RSQLite_0.11.2 rtracklayer_1.18.1
> [11] stats4_2.15.0 tools_2.15.0 XML_3.95-0.1 zlibbioc_1.4.0
>
> ########################################
>
>
> our sessions look pretty much identical, with the exception of R 2.15.0 for
> me and 2.15.2 for you.
> i'll try to push an upgrade in the next day or so and see if that might
> make a difference.
>
> it also occurred to me that the SQLite query can't be the offender, since
> this line runs perfectly swiftly:
>> foo <- select(TXDB, keys = keys(TXDB), cols = c("TXCHROM", "TXSTRAND",
> "TXSTART", "TXEND"), keytype = "GENEID")
A fair comparison would be to also select the TXID and TXNAME cols
because transcriptsBy() extracts them:
foo <- select(TXDB, keys = keys(TXDB), cols = c("TXCHROM",
"TXSTRAND", "TXSTART", "TXEND", "TXID", "TXNAME"), keytype = "GENEID")
I'm not sure why, but requesting those 2 additional cols slows down
select() by a factor 10x for me (from 1 sec to 10 sec).
One way to make sure the SQLite query isn't the offender is to use
the SQLite client (sqlite3) from the Unix command line to query the
TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite file directly. Try the
following query which is more or less the query used by
transcriptsBy(. , by="gene"):
SELECT transcript._tx_id AS tx_id, tx_name, tx_chrom, tx_strand,
tx_start, tx_end, gene_id FROM transcript INNER JOIN gene ON
(transcript._tx_id=gene._tx_id) WHERE gene_id IS NOT NULL ORDER BY
gene_id, tx_chrom, tx_strand, tx_start, tx_end;
On my laptop:
time sqlite3 TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite 'SELECT
transcript._tx_id AS tx_id, tx_name, tx_chrom, tx_strand, tx_start,
tx_end, gene_id FROM transcript INNER JOIN gene ON
transcript._tx_id=gene._tx_id WHERE gene_id IS NOT NULL ORDER BY
gene_id, tx_chrom, tx_strand, tx_start, tx_end' > sql.result
real 0m0.507s
user 0m0.368s
sys 0m0.136s
FWIW, I remember SQLite queries being painfully slow when the sqlite
file is located on OCFS (Oracle Cluster File System), but that was a
long time ago (with an old version of OCFS). Could be worth checking
the file system where your packages are installed (your .libPaths()
folder).
Cheers,
H.
>
> so i'm guessing something in the GRangesList construction might be going
> haywire?
>
> cheers,
>
> -m
>
>
> On Tue, Jan 1, 2013 at 5:11 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>> On 01/01/2013 02:05 PM, Martin Morgan wrote:
>>
>>> On 01/01/2013 01:32 PM, Murat Tasan wrote:
>>>
>>>> hi all - does anyone have any performance tips for using
>>>> transcriptsBy(TXDB, by = "gene") with the UCSC transcript database?
>>>> in particular, is the SQLite backing database file indexed (along columns
>>>> holding the internal IDs)?
>>>> i'd provide some timing results for the command execution, but i ran out
>>>> of
>>>> patience after about 10 minutes with no results...
>>>>
>>>
>>> it is 'slow' but only in the couple of seconds definition of slow.
>>> Something
>>> else is going on so a reproducible example, including sessionInfo(),
>>> would be
>>> helfpul.
>>>
>>
>> Just to follow my own advice...
>>
>>
>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene)
>> system.time(res <- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene,
>> by="gene"))
>> length(res)
>> sessionInfo()
>>
>> gives me
>>
>>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene)
>>> system.time(res <- transcriptsBy(TxDb.Hsapiens.**UCSC.hg19.knownGene,
>> by="gene"))
>> user system elapsed
>> 3.020 0.012 3.042
>>> length(res)
>> [1] 22932
>>> sessionInfo()
>> R version 2.15.2 Patched (2012-12-23 r61401)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0
>> [2] GenomicFeatures_1.10.1
>> [3] AnnotationDbi_1.20.3
>> [4] Biobase_2.18.0
>> [5] GenomicRanges_1.10.5
>> [6] IRanges_1.16.4
>> [7] BiocGenerics_0.4.0
>>
>> loaded via a namespace (and not attached):
>> [1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-5
>> BSgenome_1.26.1
>> [5] DBI_0.2-5 parallel_2.15.2 RCurl_1.95-3
>> Rsamtools_1.10.2
>> [9] RSQLite_0.11.2 rtracklayer_1.18.1 stats4_2.15.2 tools_2.15.2
>> [13] XML_3.95-0.1 zlibbioc_1.4.0
>>
>>
>>
>>>
>>>
>>>> cheers,
>>>>
>>>> -m
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>
>>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>
> [[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list