[BioC] GENEID is missing when LOCATION is non-intergenic in VariantAnnotation package
Adaikalavan Ramasamy
adaikalavan.ramasamy at gmail.com
Tue Feb 19 19:46:56 CET 2013
Dear Valerie,
Thanks once again for the quick reply. I understand there the
different the annotation databases are not consistent and I am trying
to understand the reason for some of the noise. If I look up
chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I
see that it overlaps with C1orf105 and PIGC
http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-172411762&hgsid=326905233&pubs=pack
http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-172411762
so why does the GeneID is blank in this case? On the other hand res[8,
] calls it PIGC only. Is it because C1orf105 is not a "known gene"?
BTW, the examples I sent were all non-intergenics. Thank you.
Regards, Adai
On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> Hello,
>
> All values in the output (GENEID, TXID, etc.) are taken from the annotation
> you are using. If the annotaion does not have a GENEID, TXID, etc. for a
> particular range, then none will be reported.
>
> To take a closer look at the annotation we can extract the transcripts and
> ask for gene_id, cds_id and tx_id as columns.
>
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id"))
>
>
> Looking at the first 3 rows, we see none of these have a gene_id,
>
>>> tx[1:3]
>>
>> GRanges with 3 ranges and 3 metadata columns:
>> seqnames ranges strand | tx_id gene_id
>> <Rle> <IRanges> <Rle> | <integer> <CompressedCharacterList>
>> [1] chr1 [11874, 14409] + | 1
>> [2] chr1 [11874, 14409] + | 2
>> [3] chr1 [11874, 14409] + | 3
>> cds_id
>> <CompressedIntegerList>
>> [1] NA
>> [2] 1,2,3
>> [3] NA
>
>
> To isolate the portion of the annotation that overlaps with your ranges you
> can use subsetByOverlaps(),
>
> gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"),
> IRanges(c(23803138, 172410967, 60890373,
> 44061025, 153627145), width=1))
> res <- subsetByOverlaps(tx, gr)
>
> Here is an example of a transcript with no gene_id
>
>>> res[8:10]
>>
>> GRanges with 3 ranges and 3 metadata columns:
>> seqnames ranges strand | tx_id
>> <Rle> <IRanges> <Rle> | <integer>
>> [1] chr1 [172410597, 172413230] - | 6937
>> [2] chr1 [172410869, 172411762] - | 6938
>> [3] chr17 [ 43971748, 44105699] + | 59914
>> gene_id cds_id
>> <CompressedCharacterList> <CompressedIntegerList>
>> [1] 5279 NA,20697
>> [2] 20697
>> [3] 4137 NA,178155,178156,...
>
>
>
> Also remember that if a range is 'intergenic' that it will not have a
> GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID.
>
> Valerie
>
>
>
>
> On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote:
>>
>> Dear all,
>>
>> I am finding some unexpected results (to me anyway) with the
>> VariantAnnotation package. Basically, there are situations where the
>> GENEID is missing when LOCATION is either coding, promoter, intron,
>> threeUTR or fiveUTR. Here is an example with five SNPs (among many
>> more). I have marked the unexpected results with "##".
>>
>>
>> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>
>> tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138),
>> c("rs1063412", "chr1", 172410967),
>> c("rs78291220", "chr2", 60890373),
>> c("rs116917239", "chr17", 44061025),
>> c("rs11593", "chrX", 153627145) )
>> colnames(tmp) <- c("rsid", "chr", "pos")
>> tmp$pos <- as.numeric( as.character(tmp$pos) )
>>
>> target <- with(tmp, GRanges(seqnames = Rle(chr),
>> ranges = IRanges(pos,
>> end=pos, names=rsid),
>> strand = Rle(strand("*")) ) )
>>
>> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene,
>> AllVariants())
>> names(loc) <- NULL
>> out <- as.data.frame(loc)
>> out$rsid <- names(target)[ out$QUERYID ]
>> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID",
>> "PRECEDEID", "FOLLOWID")]
>> out <- unique(out)
>> rownames(out) <- NULL
>> out
>>
>> rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID
>> 1 rs10917388 chr1 23803138 intron 55616 <NA> <NA>
>> 2 rs10917388 chr1 23803138 promoter <NA> <NA> <NA> ##
>>
>> 3 rs1063412 chr1 172410967 intron 92346 <NA> <NA>
>> 4 rs1063412 chr1 172410967 intron 5279 <NA> <NA>
>> 5 rs1063412 chr1 172410967 coding 5279 <NA> <NA>
>> 6 rs1063412 chr1 172410967 coding <NA> <NA> <NA> ##
>>
>> 7 rs78291220 chr2 60890373 promoter <NA> <NA> <NA> ##
>> 8 rs78291220 chr2 60890373 intergenic <NA> 64895 400957
>>
>> 9 rs116917239 chr17 44061025 coding 4137 <NA> <NA>
>> 10 rs116917239 chr17 44061025 intron 4137 <NA> <NA>
>> 11 rs116917239 chr17 44061025 coding <NA> <NA> <NA> ##
>>
>> 12 rs11593 chrX 153627145 intron 6134 <NA> <NA>
>> 13 rs11593 chrX 153627145 promoter 6134 <NA> <NA>
>> 14 rs11593 chrX 153627145 promoter 26778 <NA> <NA>
>> 15 rs11593 chrX 153627145 promoter <NA> <NA> <NA> ##
>> 16 rs11593 chrX 153627145 fiveUTR <NA> <NA> <NA> ##
>> 17 rs11593 chrX 153627145 threeUTR <NA> <NA> <NA> ##
>>
>> Can anyone help explain what is happening please? Is this to be
>> expected? Thank you.
>>
>> Regards, Adai
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
More information about the Bioconductor
mailing list