[BioC] GenomicRanges : How to retrieve exon ID and gene ID from exon coordinates?
Tim Triche, Jr.
tim.triche at gmail.com
Tue Sep 11 22:34:15 CEST 2012
TCGA BAMs aren't usually available to the public because they contain
personally identifiable information about actual humans (not cell lines).
BED and WIG files are not produced, as they are for (say) ENCODE or
the Roadmap project.
Instead, a file format specific to TCGA is used, and it typically
looks like this, e.g. for exons:
(from https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/cgcc/unc.edu/illuminaga_rnaseq/rnaseq/unc.edu_COAD.IlluminaGA_RNASeq.Level_3.3.4.0/UNCID_278446.TCGA-AA-3860-01A-02R-0905-07.100908_UNC6-RDR300211_00030_FC_62EL4AAXX.1.trimmed.annotated.translated_to_genomic.exon.quantification.txt)
exon raw_counts median_length_normalized RPKM
chr1:11874-12227:+ 0 0 0
...
chrM:14762-15888:+ 9917181 8799.62821650399 4252.73411539598
chrM:15999-16571:+ 287037 500.937172774869 242.095751310737
Since all of the ranges in the first column will be the same for a
given tumor type, those are mapped to the GAF, where their annotations
are kept, e.g. (from
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf)
#EntryNumber FeatureID FeatureType FeatureDBSource FeatureDBVersion FeatureDBDate FeatureSeqFileName Composite CompositeType CompositeDBSource CompositeDBVersion CompositeDBDate AlignmentType FeatureCoordinates CompositeCoordinates Gene GeneLocus FeatureAliases FeatureInfo
1 CPA1|1357 gene calculated genomic GRCh37 genome NCBI GRCh37 pairwise 1-137,138-219,220-453,454-555,556-657,658-1064,1065-1155,1156-1355,1356-1440,1441-1724 chr7:130020290-130020426,130020939-130021020,130021471-130021704,130021949-130022050,130023232-130023333,130023525-130023931,130024377-130024467,130024987-130025186,130025680-130025764,130027665-130027948:+ CPA1|1357 chr7:130020290-130027948:+
2 GUCY2D|3000 gene calculated genomic GRCh37 genome NCBI GRCh37 pairwise 1-65,66-795,796-1100,1101-1452,1453-1537,1538-1640,1641-1742,1743-1823,1824-2030,2031-2187,2188-2337,2338-2486,2487-2650,2651-2843,2844-3018,3019-3117,3118-3212,3213-3298,3299-3410,3411-3623 chr17:7905988-7906052,7906357-7907086,7907170-7907474,7909681-7910032,7910378-7910462,7910744-7910846,7911249-7911350,7912824-7912904,7915462-7915668,7915768-7915924,7916421-7916570,7917198-7917346,7917919-7918082,7918177-7918369,7918646-7918820,7919061-7919159,7919245-7919339,7919523-7919608,7919761-7919872,7923446-7923658:+
It is a drag to deal with this so I am going to cc: Marc Carlson on
this with a suggestion that we (possibly me, but better someone at UNC
or UBC) upload FDb.* packages with the GAF ranges and their metadata
contained, as annotations for the release cycle. I won't have time
until this weekend to look.
Best,
--t
On Tue, Sep 11, 2012 at 1:24 PM, James W. MacDonald <jmacdon at uw.edu> wrote:
>
> Hi Ying,
>
>
> On 9/11/2012 4:06 PM, ying chen wrote:
>>
>> Hi,
>>
>> I read the vignettes and tried the code Jim suggested, but got an error message I could not figure out why.
>>
>> Here is what I did:
>>
>> > library(GenomicFeatures)
>> > library(GenomicRanges)
>> > library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> > TCGA_CRC <-
>> read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",stringsAsFactors=FALSE)
>> > ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns =
>> c("exon_id","gene_id","tx_id"))
>> > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom,
>> ranges=IRanges(start=TCGA_CRC$Start,end=TCGA_CRC$End),strand=TCGA_CRC$Strand,name=TCGA_CRC$ExonID)
>> > elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC,
>> ex),]
>> Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] :
>> selecting rows: subscript contains NAs or out of bounds indices
>>
>
> You will get that error if there are ranges in TCGA_CRC that are not in ex. So the assumption here is that the exons from your RNA-seq experiment are exons that match up with what is in the knownGene table from UCSC.
>
> The fact that you have counts from exons that don't align to the expected ranges from the version of the knownGene table you are using is problematic, and implies that something is amiss. A first check is to ensure that your RNA-seq data (and counts) refer to the hg19 build.
>
> You can find out which ranges are different between the two using the %in% operator:
>
> x <- which(!TCGA_CRC %in% ex)
>
> Which will at least tell you how far off you are. Hypothetically, if there are only a few mismatching ranges, you could subset the TCGA_CRC GRanges object, but personally I would want to know why things aren't agreeing.
>
> Alternatively you could use the HTSeq python scripts to get the counts, in which case you won't need to bother with all of this, or you could use the summarizeOverlaps() function in GenomicRanges to do the counting, which will again ensure that the counts you have align to the exons from the knownGene table.
>
> Best,
>
> Jim
>
>
>
>
>
>> >
>>
>> I cannot tell what is subscript it mentioned.
>>
>> Any suggestion?
>>
>> Thanks a lot fo the help!
>>
>> Ying
>>
>> The following are the some info:
>>
>> > sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-pc-mingw32/x64 (64-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 GenomicFeatures_1.8.3 AnnotationDbi_1.18.3
>> [4] Biobase_2.16.0 GenomicRanges_1.8.13 IRanges_1.14.4
>> [7] BiocGenerics_0.2.0
>>
>> loaded via a namespace (and not attached):
>> [1] biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1 BSgenome_1.24.0 DBI_0.2-5 RCurl_1.91-1.1 Rsamtools_1.8.6
>> [8] RSQLite_0.11.1 rtracklayer_1.16.3 stats4_2.15.1 tools_2.15.1 XML_3.9-4.1 zlibbioc_1.2.0
>> >
>>
>> > TCGA_CRC
>> GRanges with 239886 ranges and 1 elementMetadata col:
>> seqnames ranges strand | name
>> <Rle> <IRanges> <Rle> | <character>
>> [1] chr10 [100003848, 100004653] + | chr10:100003848-100004653:+
>> [2] chr10 [100007443, 100008748] - | chr10:100008748-100007443:-
>> [3] chr10 [100010822, 100010933] - | chr10:100010933-100010822:-
>> [4] chr10 [100011323, 100011459] - | chr10:100011459-100011323:-
>> [5] chr10 [100012110, 100012225] - | chr10:100012225-100012110:-
>> [6] chr10 [100013310, 100013553] - | chr10:100013553-100013310:-
>> [7] chr10 [100015334, 100015496] - | chr10:100015496-100015334:-
>> [8] chr10 [100016537, 100016704] - | chr10:100016704-100016537:-
>> [9] chr10 [100017407, 100017561] - | chr10:100017561-100017407:-
>> ... ... ... ... ... ...
>> [239878] chrY [9611654, 9611928] - | chrY:9611928-9611654:-
>> [239879] chrY [9638762, 9638916] + | chrY:9638762-9638916:+
>> [239880] chrY [9642383, 9642494] + | chrY:9642383-9642494:+
>> [239881] chrY [9646920, 9646994] + | chrY:9646920-9646994:+
>> [239882] chrY [9647680, 9647718] + | chrY:9647680-9647718:+
>> [239883] chrY [9650809, 9650854] + | chrY:9650809-9650854:+
>> [239884] chrY [9748407, 9748463] + | chrY:9748407-9748463:+
>> [239885] chrY [9748577, 9748722] + | chrY:9748577-9748722:+
>> [239886] chrY [9749263, 9749571] + | chrY:9749263-9749571:+
>> ---
>> seqlengths:
>> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY
>> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>> >
>> > ex
>> GRanges with 286852 ranges and 3 elementMetadata cols:
>> seqnames ranges strand | exon_id gene_id tx_id
>> <Rle> <IRanges> <Rle> | <integer> <CompressedCharacterList> <CompressedIntegerList>
>> [1] chr1 [ 11874, 12227] + | 1 NA 1,2,3
>> [2] chr1 [ 12595, 12721] + | 5 NA 3
>> [3] chr1 [ 12613, 12721] + | 2 NA 1
>> [4] chr1 [ 12646, 12697] + | 4 NA 2
>> [5] chr1 [ 13221, 14409] + | 3 NA 1,2
>> [6] chr1 [ 13403, 14409] + | 6 NA 3
>> [7] chr1 [ 69091, 70008] + | 37 79501 25
>> [8] chr1 [321084, 321115] + | 44 NA 28
>> [9] chr1 [321146, 321207] + | 45 NA 29
>> ... ... ... ... ... ... ... ...
>> [286844] chrY [27603458, 27603499] - | 142741 NA 40070
>> [286845] chrY [27604352, 27604379] - | 142742 NA 40071
>> [286846] chrY [27605645, 27605678] - | 142743 NA 40072
>> [286847] chrY [27606394, 27606421] - | 142744 NA 40073
>> [286848] chrY [27607404, 27607432] - | 142745 NA 40074
>> [286849] chrY [27635919, 27635954] - | 142750 NA 40078
>> [286850] chrY [59358329, 59359508] - | 142802 NA 40099
>> [286851] chrY [59360007, 59360115] - | 142803 NA 40099
>> [286852] chrY [59360501, 59360854] - | 142804 NA 40099
>> ---
>> seqlengths:
>> chr1 chr2 chr3 ... chrM chrUn_gl000226 chr18_gl000207_random
>> 249250621 243199373 198022430 ... 16571 15008 4262
>> >
>>
>> > Date: Mon, 10 Sep 2012 17:29:24 -0400
>> > From: jmacdon at uw.edu
>> > To: ying_chen at live.com
>> > CC: bioconductor at r-project.org
>> > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon coordinates?
>> >
>> > Hi Ying,
>> >
>> > On 9/10/2012 4:54 PM, ying chen wrote:
>> > >
>> > >
>> > > Hi guys, I have a RNASeq data table which has exon cooridinates (chrom, start. end) and raw count. I want to use DEXseq to see differential transcripts. To do it I need to get geneIDs and exonIDs from corresponding exon cooridinates. Any suggestion how to do it? Thanks a lot for the help!
>> >
>> > You don't give much to go on. Assuming you are working with a common
>> > species, it is simple. Let's assume you are working with mice.
>> >
>> > Something like this should work:
>> >
>> > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE)
>> > library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>> > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns =
>> > c("exon_id","gene_id"))
>> > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start,
>> > end=yourdata$end))
>> > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),]
>> >
>> > If you are planning on doing this sort of stuff, do yourself a favor and
>> > read the GenomicFeatures and GenomicRanges vignettes. They are chock
>> > full of info that you will need.
>> >
>> > Best,
>> >
>> > Jim
>> >
>> >
>> >
>> > > Ying
>> > > [[alternative HTML version deleted]]
>> > >
>> > > _______________________________________________
>> > > Bioconductor mailing list
>> > > Bioconductor at r-project.org
>> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>> > --
>> > James W. MacDonald, M.S.
>> > Biostatistician
>> > University of Washington
>> > Environmental and Occupational Health Sciences
>> > 4225 Roosevelt Way NE, # 100
>> > Seattle WA 98105-6099
>> >
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
A model is a lie that helps you see the truth.
Howard Skipper
More information about the Bioconductor
mailing list