[BioC] retrieving annotation
Kathi Zarnack
zarnack at ebi.ac.uk
Tue Nov 19 16:10:34 CET 2013
Hi Nico,
thanks for the hint. I will have a look at AnnotationHub. I was looking
for the transcript_biotype rather than the gene_biotype (to discriminate
protein_coding isoforms from the rest like processed_transcript etc),
but this should also be included in the Ensembl gtf file.
Thanks,
Kathi
On 16/11/13 22:39, Nicolas Delhomme wrote:
> Hej Kathi!
>
> In a different thread (GTF file error when using easyRNAseq), Martin mentioned that you can access ensemble gff files through AnnotationHub. I just copy part of this answer below and as you can see, the gene_biotype is part of the annotation:
>
>> library(AnnotationHub)
>> hub = AnnotationHub()
>> hub$ensembl.release.73.<tab>
> hub$ensembl.release.73.fasta. ... [378]
> hub$ensembl.release.73.gtf. ... [63]
>> xx = hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4.73.gtf_0.0.1.RData
>> xx
> GRanges with 381368 ranges and 12 metadata columns:
> seqnames ranges strand | source type
> <Rle> <IRanges> <Rle> | <factor> <factor>
> [1] 1 [1735, 2449] + | protein_coding exon
> [2] 1 [2379, 2449] + | protein_coding CDS
> score phase gene_id transcript_id
> <numeric> <integer> <character> <character>
> [1] <NA> <NA> ENSGALG00000009771 ENSGALT00000015891
> [2] <NA> 0 ENSGALG00000009771 ENSGALT00000015891
> exon_number gene_biotype exon_id protein_id
> <numeric> <character> <character> <character>
> [1] 1 protein_coding ENSGALE00000301221 <NA>
> [2] 1 protein_coding <NA> ENSGALP00000015874
> gene_name transcript_name
> <character> <character>
> [1] <NA> <NA>
> [2] <NA> <NA>
> [ reached getOption("max.print") -- omitted 9 rows ]
> ---
> seqlengths:
> 1 2 ... AADN03010940.1
> NA NA … NA
>
> Hope this helps,
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On 7 Nov 2013, at 14:11, Kathi Zarnack <zarnack at ebi.ac.uk> wrote:
>
>> Hi,
>>
>> I wanted to ask whether any of the annotation packages contains information on the transcript biotype (protein-coding, etc). I would like to select only protein-coding isoforms from Ensembl annotation, but I could not find any package that includes this information (otherwise I will get it with biomaRt, I just wondered whether it is already included somewhere).
>>
>> Also, I tried to download GENCODE annotation using GenomicFeatures, and got the following error:
>>
>>> test=makeTranscriptDbFromUCSC(genome="hg19", tablename="wgEncodeGencodeManualV3")
>> Error in tableNames(ucscTableQuery(session, track = track)) :
>> error in evaluating the argument 'object' in selecting a method for function 'tableNames': Error in normArgTrack(track, trackids) : Unknown track: Gencode Genes
>>
>> I tried to get the same table for hg18, but I get only one step further:
>>
>> test=makeTranscriptDbFromUCSC(genome="hg18", tablename="wgEncodeGencodeManualV3")
>> Download the wgEncodeGencodeManualV3 table ... OK
>> Download the wgEncodeGencodeClassesV3 table ... Error in normArgTable(value, x) :
>> unknown table name 'wgEncodeGencodeClassesV3'
>>
>> Thank you very much for your help,
>> Kathi
>>
>>
>> ------------------------------------------
>>
>>> sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0
>> [4] GenomicRanges_1.14.3 XVector_0.2.0 IRanges_1.20.5
>> [7] BiocGenerics_0.8.0 BiocInstaller_1.12.0
>>
>> loaded via a namespace (and not attached):
>> [1] biomaRt_2.18.0 Biostrings_2.30.0 bitops_1.0-6 BSgenome_1.30.0
>> [5] DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.14.1 RSQLite_0.11.4
>> [9] rtracklayer_1.22.0 stats4_3.0.2 tcltk_3.0.2 tools_3.0.2
>> [13] XML_3.98-1.1 zlibbioc_1.8.0
>>
>>
>> --
>> Dr. Kathi Zarnack
>> Luscombe Group
>>
>> European Molecular Biology Laboratory
>> European Bioinformatics Institute (EMBL-EBI)
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>> United Kingdom
>>
>> emailzarnack at ebi.ac.uk
>> tel +44 1223 494 526
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Dr. Kathi Zarnack
Luscombe Group
European Molecular Biology Laboratory
European Bioinformatics Institute (EMBL-EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
email zarnack at ebi.ac.uk
tel +44 1223 494 526
More information about the Bioconductor
mailing list