[BioC] GenomicRanges : How to retrieve exon ID and gene ID from exon coordinates?

Tue Sep 11 22:34:15 CEST 2012

TCGA BAMs aren't usually available to the public because they contain
personally identifiable information about actual humans (not cell lines).

BED and WIG files are not produced, as they are for (say) ENCODE or
the Roadmap project.
Instead, a file format specific to TCGA is used, and it typically
looks like this, e.g. for exons:

(from https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/cgcc/unc.edu/illuminaga_rnaseq/rnaseq/unc.edu_COAD.IlluminaGA_RNASeq.Level_3.3.4.0/UNCID_278446.TCGA-AA-3860-01A-02R-0905-07.100908_UNC6-RDR300211_00030_FC_62EL4AAXX.1.trimmed.annotated.translated_to_genomic.exon.quantification.txt)

exon raw_counts median_length_normalized RPKM
chr1:11874-12227:+ 0 0 0
...
chrM:14762-15888:+	9917181	8799.62821650399	4252.73411539598
chrM:15999-16571:+	287037	500.937172774869	242.095751310737

Since all of the ranges in the first column will be the same for a
given tumor type, those are mapped to the GAF, where their annotations
are kept, e.g. (from
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf)

#EntryNumber	FeatureID	FeatureType	FeatureDBSource	FeatureDBVersion	FeatureDBDate	FeatureSeqFileName	Composite	CompositeType	CompositeDBSource	CompositeDBVersion	CompositeDBDate	AlignmentType	FeatureCoordinates	CompositeCoordinates	Gene	GeneLocus	FeatureAliases	FeatureInfo
1	CPA1|1357	gene	calculated			genomic	GRCh37	genome	NCBI	GRCh37		pairwise	1-137,138-219,220-453,454-555,556-657,658-1064,1065-1155,1156-1355,1356-1440,1441-1724	chr7:130020290-130020426,130020939-130021020,130021471-130021704,130021949-130022050,130023232-130023333,130023525-130023931,130024377-130024467,130024987-130025186,130025680-130025764,130027665-130027948:+	CPA1|1357	chr7:130020290-130027948:+
2	GUCY2D|3000	gene	calculated			genomic	GRCh37	genome	NCBI	GRCh37		pairwise	1-65,66-795,796-1100,1101-1452,1453-1537,1538-1640,1641-1742,1743-1823,1824-2030,2031-2187,2188-2337,2338-2486,2487-2650,2651-2843,2844-3018,3019-3117,3118-3212,3213-3298,3299-3410,3411-3623	chr17:7905988-7906052,7906357-7907086,7907170-7907474,7909681-7910032,7910378-7910462,7910744-7910846,7911249-7911350,7912824-7912904,7915462-7915668,7915768-7915924,7916421-7916570,7917198-7917346,7917919-7918082,7918177-7918369,7918646-7918820,7919061-7919159,7919245-7919339,7919523-7919608,7919761-7919872,7923446-7923658:+

It is a drag to deal with this so I am going to cc: Marc Carlson on
this with a suggestion that we (possibly me, but better someone at UNC
or UBC) upload FDb.* packages with the GAF ranges and their metadata
contained, as annotations for  the release cycle.  I won't have time
until this weekend to look.

Best,

--t

On Tue, Sep 11, 2012 at 1:24 PM, James W. MacDonald <jmacdon at uw.edu> wrote:
>
> Hi Ying,
>
>
> On 9/11/2012 4:06 PM, ying chen wrote:
>>
>> Hi,
>>
>> I read the vignettes and tried the code Jim suggested, but got an error message I could not figure out why.
>>
>> Here is what I did:
>>
>>     > library(GenomicFeatures)
>>     > library(GenomicRanges)
>>     > library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>     > TCGA_CRC <-
>>     read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",stringsAsFactors=FALSE)
>>     > ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns =
>>     c("exon_id","gene_id","tx_id"))
>>     > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom,
>>     ranges=IRanges(start=TCGA_CRC$Start,end=TCGA_CRC$End),strand=TCGA_CRC$Strand,name=TCGA_CRC$ExonID)
>>     > elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC,
>>     ex),]
>>     Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] :
>>       selecting rows: subscript contains NAs or out of bounds indices
>>
>
> You will get that error if there are ranges in TCGA_CRC that are not in ex. So the assumption here is that the exons from your RNA-seq experiment are exons that match up with what is in the knownGene table from UCSC.
>
> The fact that you have counts from exons that don't align to the expected ranges from the version of the knownGene table you are using is problematic, and implies that something is amiss. A first check is to ensure that your RNA-seq data (and counts) refer to the hg19 build.
>
> You can find out which ranges are different between the two using the %in% operator:
>
> x <- which(!TCGA_CRC %in% ex)
>
> Which will at least tell you how far off you are. Hypothetically, if there are only a few mismatching ranges, you could subset the TCGA_CRC GRanges object, but personally I would want to know why things aren't agreeing.
>
> Alternatively you could use the HTSeq python scripts to get the counts, in which case you won't need to bother with all of this, or you could use the summarizeOverlaps() function in GenomicRanges to do the counting, which will again ensure that the counts you have align to the exons from the knownGene table.
>
> Best,
>
> Jim
>
>
>
>
>
>>     >
>>
>> I cannot tell what is subscript it mentioned.
>>
>> Any suggestion?
>>
>> Thanks a lot fo the help!
>>
>> Ying
>>
>> The following are the some info:
>>
>> > sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-pc-mingw32/x64 (64-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 GenomicFeatures_1.8.3                   AnnotationDbi_1.18.3
>> [4] Biobase_2.16.0                          GenomicRanges_1.8.13                    IRanges_1.14.4
>> [7] BiocGenerics_0.2.0
>>
>> loaded via a namespace (and not attached):
>>  [1] biomaRt_2.12.0     Biostrings_2.24.1  bitops_1.0-4.1     BSgenome_1.24.0    DBI_0.2-5          RCurl_1.91-1.1     Rsamtools_1.8.6
>>  [8] RSQLite_0.11.1     rtracklayer_1.16.3 stats4_2.15.1      tools_2.15.1       XML_3.9-4.1        zlibbioc_1.2.0
>> >
>>
>> > TCGA_CRC
>> GRanges with 239886 ranges and 1 elementMetadata col:
>>            seqnames                 ranges strand   |                        name
>> <Rle> <IRanges> <Rle>   | <character>
>>        [1]    chr10 [100003848, 100004653]      +   | chr10:100003848-100004653:+
>>        [2]    chr10 [100007443, 100008748]      -   | chr10:100008748-100007443:-
>>        [3]    chr10 [100010822, 100010933]      -   | chr10:100010933-100010822:-
>>        [4]    chr10 [100011323, 100011459]      -   | chr10:100011459-100011323:-
>>        [5]    chr10 [100012110, 100012225]      -   | chr10:100012225-100012110:-
>>        [6]    chr10 [100013310, 100013553]      -   | chr10:100013553-100013310:-
>>        [7]    chr10 [100015334, 100015496]      -   | chr10:100015496-100015334:-
>>        [8]    chr10 [100016537, 100016704]      -   | chr10:100016704-100016537:-
>>        [9]    chr10 [100017407, 100017561]      -   | chr10:100017561-100017407:-
>>        ...      ...                    ...    ... ...                         ...
>>   [239878]     chrY     [9611654, 9611928]      -   |      chrY:9611928-9611654:-
>>   [239879]     chrY     [9638762, 9638916]      +   |      chrY:9638762-9638916:+
>>   [239880]     chrY     [9642383, 9642494]      +   |      chrY:9642383-9642494:+
>>   [239881]     chrY     [9646920, 9646994]      +   |      chrY:9646920-9646994:+
>>   [239882]     chrY     [9647680, 9647718]      +   |      chrY:9647680-9647718:+
>>   [239883]     chrY     [9650809, 9650854]      +   |      chrY:9650809-9650854:+
>>   [239884]     chrY     [9748407, 9748463]      +   |      chrY:9748407-9748463:+
>>   [239885]     chrY     [9748577, 9748722]      +   |      chrY:9748577-9748722:+
>>   [239886]     chrY     [9749263, 9749571]      +   |      chrY:9749263-9749571:+
>>   ---
>>   seqlengths:
>>     chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrM  chrX  chrY
>>       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
>> >
>> > ex
>> GRanges with 286852 ranges and 3 elementMetadata cols:
>>            seqnames               ranges strand   |   exon_id                   gene_id                   tx_id
>> <Rle> <IRanges> <Rle>   | <integer> <CompressedCharacterList> <CompressedIntegerList>
>>        [1]     chr1     [ 11874,  12227]      +   |         1                        NA                   1,2,3
>>        [2]     chr1     [ 12595,  12721]      +   |         5                        NA                       3
>>        [3]     chr1     [ 12613,  12721]      +   |         2                        NA                       1
>>        [4]     chr1     [ 12646,  12697]      +   |         4                        NA                       2
>>        [5]     chr1     [ 13221,  14409]      +   |         3                        NA                     1,2
>>        [6]     chr1     [ 13403,  14409]      +   |         6                        NA                       3
>>        [7]     chr1     [ 69091,  70008]      +   |        37                     79501                      25
>>        [8]     chr1     [321084, 321115]      +   |        44                        NA                      28
>>        [9]     chr1     [321146, 321207]      +   |        45                        NA                      29
>>        ...      ...                  ...    ... ...       ...                       ...                     ...
>>   [286844]     chrY [27603458, 27603499]      -   |    142741                        NA                   40070
>>   [286845]     chrY [27604352, 27604379]      -   |    142742                        NA                   40071
>>   [286846]     chrY [27605645, 27605678]      -   |    142743                        NA                   40072
>>   [286847]     chrY [27606394, 27606421]      -   |    142744                        NA                   40073
>>   [286848]     chrY [27607404, 27607432]      -   |    142745                        NA                   40074
>>   [286849]     chrY [27635919, 27635954]      -   |    142750                        NA                   40078
>>   [286850]     chrY [59358329, 59359508]      -   |    142802                        NA                   40099
>>   [286851]     chrY [59360007, 59360115]      -   |    142803                        NA                   40099
>>   [286852]     chrY [59360501, 59360854]      -   |    142804                        NA                   40099
>>   ---
>>   seqlengths:
>>                     chr1                  chr2                  chr3 ...                  chrM        chrUn_gl000226 chr18_gl000207_random
>>                249250621             243199373             198022430 ...                 16571                 15008                  4262
>> >
>>
>> > Date: Mon, 10 Sep 2012 17:29:24 -0400
>> > From: jmacdon at uw.edu
>> > To: ying_chen at live.com
>> > CC: bioconductor at r-project.org
>> > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon coordinates?
>> >
>> > Hi Ying,
>> >
>> > On 9/10/2012 4:54 PM, ying chen wrote:
>> > >
>> > >
>> > > Hi guys, I have a RNASeq data table which has exon cooridinates (chrom, start. end) and raw count. I want to use DEXseq to see differential transcripts. To do it I need to get geneIDs and exonIDs from corresponding exon cooridinates. Any suggestion how to do it? Thanks a lot for the help!
>> >
>> > You don't give much to go on. Assuming you are working with a common
>> > species, it is simple. Let's assume you are working with mice.
>> >
>> > Something like this should work:
>> >
>> > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE)
>> > library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>> > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns =
>> > c("exon_id","gene_id"))
>> > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start,
>> > end=yourdata$end))
>> > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),]
>> >
>> > If you are planning on doing this sort of stuff, do yourself a favor and
>> > read the GenomicFeatures and GenomicRanges vignettes. They are chock
>> > full of info that you will need.
>> >
>> > Best,
>> >
>> > Jim
>> >
>> >
>> >
>> > > Ying
>> > > [[alternative HTML version deleted]]
>> > >
>> > > _______________________________________________
>> > > Bioconductor mailing list
>> > > Bioconductor at r-project.org
>> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>> > --
>> > James W. MacDonald, M.S.
>> > Biostatistician
>> > University of Washington
>> > Environmental and Occupational Health Sciences
>> > 4225 Roosevelt Way NE, # 100
>> > Seattle WA 98105-6099
>> >
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
A model is a lie that helps you see the truth.

Howard Skipper