[BioC] Extracting strand information from GenomicFeatures transcript db objects

Hervé Pagès hpages at fhcrc.org
Fri Mar 14 18:49:44 CET 2014


On 03/14/2014 10:46 AM, Hervé Pagès wrote:
> Hi Veerendra,
>
> Just to let you know that I've fixed the warning you get with
> makeTranscriptDbFromGFF() that "the chromosome lengths and
> circularity flags are not available for this TranscriptDb object".

I mean, I fixed the warning you get with makeTranscriptDbFromBiomart(),
not with makeTranscriptDbFromGFF() (which I didn't touch).

H.

> The fix is in GenomicFeatures 1.14.5 (release) and 1.15.11 (devel),
> both of which will become available via biocLite() in the next hour
> or so.
>
> Cheers,
> H.
>
>
> On 03/07/2014 03:24 PM, Hervé Pagès wrote:
>> Hi Veerendra,
>>
>> Don't know about the error you get with makeTranscriptDbFromGFF().
>>
>> However, please note that you should also be able to create that
>> TranscriptDb object thru the Ensembl Mart:
>>
>>    library(GenomicFeatures)
>>    txdb <- makeTranscriptDbFromBiomart("ensembl",
>> "mmusculus_gene_ensembl")
>>
>> Then:
>>
>>    > txdb
>>    TranscriptDb object:
>>    | Db type: TranscriptDb
>>    | Supporting package: GenomicFeatures
>>    | Data source: BioMart
>>    | Organism: Mus musculus
>>    | Resource URL: www.biomart.org:80
>>    | BioMart database: ensembl
>>    | BioMart database version: ENSEMBL GENES 75 (SANGER UK)
>>    | BioMart dataset: mmusculus_gene_ensembl
>>    | BioMart dataset description: Mus musculus genes (GRCm38.p2)
>>    | BioMart dataset version: GRCm38.p2
>>    | Full dataset: yes
>>    | miRBase build ID: NA
>>    | transcript_nrow: 94929
>>    | exon_nrow: 404385
>>    | cds_nrow: 313584
>>    | Db created by: GenomicFeatures package from Bioconductor
>>    | Creation time: 2014-03-07 15:20:13 -0800 (Fri, 07 Mar 2014)
>>    | GenomicFeatures version at creation time: 1.14.3
>>    | RSQLite version at creation time: 0.11.4
>>    | DBSCHEMAVERSION: 1.0
>>
>> For whatever reason, I got a warning that the chromosome lengths and
>> circularity flags are not available for this TranscriptDb object (I'll
>> look into this). Not a big deal though, the TranscriptDb object should
>> still be fully functional.
>>
>> Cheers,
>> H.
>>
>>
>> On 03/07/2014 01:59 PM, Veerendra Gadekar wrote:
>>> Dear Members,
>>>
>>> Greetings!
>>>
>>> I am trying to create a Transcript db object, from the GTF file for
>>> Musmusculus transcriptome data available at ensembl FTP site
>>>
>>> (version 75). ftp://ftp.ensembl.org/pub/release-75/gtf/mus_musculus
>>>
>>> I am using the function, "makeTranscriptDbFromGFF()" from the "
>>> GenomicFeatures" bioconductor library.
>>>
>>> *Command used:*
>>>
>>>> library("GenomicFeatures")
>>>> txdb_75 = makeTranscriptDbFromGFF (file =
>>>> '/Mus_musculus.GRCm38.75.gtf',
>>> exonRankAttributeName='exon_number', format='gtf', dataSource='ENSEMBL',
>>> species='Mus musculus')
>>>
>>> *Error message:*
>>>
>>> extracting transcript information
>>> Estimating transcript ranges.
>>> Error in unique(sub$transcript_id) :
>>> error in evaluating the argument 'x' in selecting a method for function
>>> 'unique': Error in subs[[i]] : subscript out of bounds
>>>
>>> I am not able to understand the above error message. However, I could
>>> create the transcript db object successfully for the GTF file
>>>
>>> downloaded from ensembl version 74.
>>> ftp://ftp.ensembl.org/pub/release-74/gtf/mus_musculus/
>>>
>>> *Session information:*
>>>
>>>> sessionInfo()
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>
>>> locale:
>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] data.table_1.9.2       biomaRt_2.16.0         GenomicFeatures_1.12.4
>>> [4] AnnotationDbi_1.22.6   Biobase_2.20.1         GenomicRanges_1.12.5
>>> [7] IRanges_1.18.4         BiocGenerics_0.6.0     BiocInstaller_1.10.4
>>>
>>> loaded via a namespace (and not attached):
>>>   [1] Biostrings_2.28.0  bitops_1.0-6       BSgenome_1.28.0    DBI_0.2-7
>>>
>>>   [5] plyr_1.8.1         Rcpp_0.11.0        RCurl_1.95-4.1
>>> reshape2_1.2.2
>>>   [9] Rsamtools_1.12.4   RSQLite_0.11.4     rtracklayer_1.20.4
>>> stats4_3.0.2
>>>
>>> [13] stringr_0.6.2      tools_3.0.2        XML_3.98-1.1
>>> zlibbioc_1.6.0
>>>
>>> Your assistance in this regard would be very much appreciated.
>>>
>>> Thank you in advance.
>>>
>>> Best Regards,
>>>
>>> Veerendra
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 13, 2013 at 1:32 PM, Veerendra GP
>>> <gpveerendra09 at gmail.com>wrote:
>>>
>>>> Dear Valerie,
>>>>
>>>> Thank you very much for the descriptive reply and your suggestion, it
>>>> indeed was a cryptic error for me but now I understand the cause.
>>>>
>>>> Thanks,
>>>> Veerendra
>>>>
>>>> On Wed, Feb 13, 2013 at 12:19 AM, Valerie Obenchain
>>>> <vobencha at fhcrc.org>wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Thanks for sending a reproducible example. Sorry for the cryptic error
>>>>> message. That is something we could improve.
>>>>>
>>>>> The problem is that for the UCSC data, some transcripts in a single
>>>>> gene
>>>>> are from different strands. In this case the strand Rle will have
>>>>> multiple
>>>>> runValues (i.e. runValue longer than 1).
>>>>>
>>>>> When we summarize strand runValues by elementLength we see the UCSC
>>>>> data
>>>>> has runValues of 1, 2, 3, 4 and 5.
>>>>>
>>>>>   table(elementLengths(runValue(strand(GenegroupedTranscripts))))
>>>>>>
>>>>>
>>>>>      1
>>>>> 37315
>>>>>
>>>>>> table(elementLengths(runValue(strand(UCSCGenegroupedTranscripts))))
>>>>>>
>>>>>
>>>>>      1     2     3     4     5
>>>>> 21686    72     1     1     1
>>>>>
>>>>> Here is a closer look at the entry that has an elementLength of 3:
>>>>>
>>>>>   UCSC_elts <- elementLengths(runValue(strand(
>>>>>> UCSCGenegroupedTranscripts)))
>>>>>> UCSCGenegroupedTranscripts[UCSC_elts == 3]
>>>>>>
>>>>>
>>>>> GRangesList of length 1:
>>>>> $108816
>>>>> GRanges with 5 ranges and 2 metadata columns:
>>>>>            seqnames               ranges strand |     tx_id
>>>>> tx_name
>>>>>               <Rle>            <IRanges>  <Rle> | <integer>
>>>>> <character>
>>>>>    [1]         chr4 [42452733, 42476567]      + |     10641
>>>>> uc008smu.1
>>>>>    [2]         chr4 [42471253, 42471291]      + |     10642
>>>>> uc008smv.1
>>>>>    [3]         chr4 [41902019, 41925853]      - |     12216
>>>>> uc008slb.1
>>>>>    [4]         chr4 [41907295, 41907333]      - |     12217
>>>>> uc008slg.1
>>>>>    [5] chrUn_random [  588639,   612480]      + |     55359
>>>>> uc012hdn.1
>>>>>
>>>>>
>>>>> The strand Rle for this gene is,
>>>>>
>>>>>   strand(UCSCGenegroupedTranscripts[UCSC_elts == 3])
>>>>>>
>>>>> CompressedRleList of length 1
>>>>> $`108816`
>>>>> factor-Rle of length 5 with 3 runs
>>>>>    Lengths: 2 2 1
>>>>>    Values : + - +
>>>>> Levels(3): + - *
>>>>>
>>>>> In the case of multiple stands per list element (per gene in this
>>>>> case)
>>>>> you can't use as.integer(). It looks like you are after a vector of a
>>>>> single strand per gene. Depending on your use case you could remove
>>>>> these
>>>>> genes or set the strand of these genes to '*'.
>>>>>
>>>>>
>>>>> Valerie
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 02/12/2013 08:11 AM, Veerendra GP wrote:
>>>>>
>>>>>> Hello Everyone!
>>>>>>
>>>>>> Greetings!
>>>>>>
>>>>>> I am using "GenomicFeatures" library to create Transcript db
>>>>>> objects for
>>>>>> the mouse genome.
>>>>>> I did it by using biomart, *makeTranscriptDbFromBiomart()* and also
>>>>>>
>>>>>> UCSC, makeTranscriptDbFromUCSC()
>>>>>> functions.
>>>>>>
>>>>>> As I am interested in the strand information, I tried fetching the it
>>>>>> using
>>>>>> *as.integer(),* *runValue**()* & *strand()* functions. I am able to
>>>>>> get
>>>>>> it
>>>>>> from the transcriptdb object, created using
>>>>>> *makeTranscriptDbFromBiomart()* but
>>>>>>
>>>>>> ending up with an error when I am trying to do the same with
>>>>>> transcriptdb object created using makeTranscriptDbFromUCSC()
>>>>>> function*.*
>>>>>>
>>>>>>
>>>>>> working code:
>>>>>>
>>>>>> mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl",
>>>>>> dataset=
>>>>>> "mmusculus_gene_ensembl");
>>>>>> saveDb(mouseGNM, file="MouseGNM.sqlite");
>>>>>> library("GenomicFeatures");
>>>>>> MouseGNM<- loadDb("MouseGNM.sqlite");
>>>>>> GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene");
>>>>>> strand.info <-as.integer(runValue(strand(GenegroupedTranscripts)));
>>>>>>
>>>>>>   runValue(strand(GenegroupedTranscripts)) CompressedIntegerList of
>>>>>>> length
>>>>>>>
>>>>>> 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2
>>>>>> [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2
>>>>>> [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1
>>>>>> [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1
>>>>>> [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305 more
>>>>>> elements>
>>>>>>
>>>>>>   strand.info[1:100]
>>>>>>>
>>>>>>    [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1
>>>>>> 1 2 1
>>>>>> 2 2
>>>>>> 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1
>>>>>> 1 1 2
>>>>>> 1
>>>>>> 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1
>>>>>>
>>>>>> here 2 is the antisense strand and 1 the sense strand
>>>>>>
>>>>>>
>>>>>> UCSC.mouseGNM <- makeTranscriptDbFromUCSC(genome = "mm9", tablename =
>>>>>> "knownGene");
>>>>>> saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite");
>>>>>> UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite");
>>>>>> UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by =
>>>>>> "gene");
>>>>>> strand.info2<-
>>>>>> as.integer(runValue(strand(UCSCGenegroupedTranscripts)));
>>>>>>
>>>>>>   runValue(strand(UCSCGenegroupedTranscripts))
>>>>>>>
>>>>>> CompressedIntegerList of length 21761
>>>>>> [["100009600"]] 2
>>>>>> [["100009609"]] 2
>>>>>> [["100009614"]] 1
>>>>>> [["100012"]] 2
>>>>>> [["100017"]] 2
>>>>>> [["100019"]] 1
>>>>>> [["100033459"]] 1
>>>>>> [["100034251"]] 1
>>>>>> [["100034361"]] 2
>>>>>> [["100034684"]] 2
>>>>>> ...
>>>>>> <21751 more elements>
>>>>>>
>>>>>> *ERROR:*
>>>>>>
>>>>>>   strand.info2<- as.integer(runValue(strand(
>>>>>>> UCSCGenegroupedTranscripts)));
>>>>>>>
>>>>>> Error in as.vector(x, mode = "integer") : coercing an AtomicList
>>>>>> object
>>>>>> to
>>>>>> an atomic vector is supported only for objects with top-level
>>>>>> elements of
>>>>>> length <= 1
>>>>>>
>>>>>> I am not able to understand this error message, could anyone let me
>>>>>> know
>>>>>> what is going wrong in the code.
>>>>>> Here is the session information.
>>>>>>
>>>>>>   sessionInfo() R version 2.15.2 (2012-10-26) Platform:
>>>>>>> x86_64-pc-linux-gnu
>>>>>>>
>>>>>> (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3]
>>>>>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5]
>>>>>> LC_MONETARY=en_US.UTF-8
>>>>>> LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C
>>>>>> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>> attached
>>>>>> base packages: [1] stats graphics grDevices utils datasets methods
>>>>>> base
>>>>>> other attached packages: [1] GenomicFeatures_1.10.1
>>>>>> AnnotationDbi_1.20.3
>>>>>> Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4
>>>>>> BiocGenerics_0.4.0
>>>>>> loaded via a namespace (and not attached): [1] biomaRt_2.14.0
>>>>>> Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5
>>>>>> parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2
>>>>>> rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1
>>>>>> zlibbioc_1.4.0
>>>>>>
>>>>>>
>>>>>> Thank you in advance.
>>>>>> Regards,
>>>>>>
>>>>>> Veerendra.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> GP.Veerendra
>>>> PhD Student (Bioinformatics)
>>>> Stazione Zoologica Anton Dohrn
>>>> Naples, Italy
>>>> M:+393663915221
>>>>
>>>
>>>
>>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list