[BioC] GENEID is missing when LOCATION is non-intergenic in VariantAnnotation package

Tue Feb 19 19:46:56 CET 2013

Dear Valerie,

Thanks once again for the quick reply. I understand there the
different the annotation databases are not consistent and I am trying
to understand the reason for some of the noise. If I look up
chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I
see that it overlaps with C1orf105 and PIGC

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-172411762&hgsid=326905233&pubs=pack
http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-172411762

so why does the GeneID is blank in this case? On the other hand res[8,
] calls it PIGC only. Is it because C1orf105 is not a "known gene"?
BTW, the examples I sent were all non-intergenics. Thank you.

Regards, Adai

On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> Hello,
>
> All values in the output (GENEID, TXID, etc.) are taken from the annotation
> you are using. If the annotaion does not have a GENEID, TXID, etc. for a
> particular range, then none will be reported.
>
> To take a closer look at the annotation we can extract the transcripts and
> ask for gene_id, cds_id and tx_id as columns.
>
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id"))
>
>
> Looking at the first 3 rows, we see none of these have a gene_id,
>
>>> tx[1:3]
>>
>> GRanges with 3 ranges and 3 metadata columns:
>>       seqnames         ranges strand |     tx_id                   gene_id
>>          <Rle>      <IRanges>  <Rle> | <integer> <CompressedCharacterList>
>>   [1]     chr1 [11874, 14409]      + |         1
>>   [2]     chr1 [11874, 14409]      + |         2
>>   [3]     chr1 [11874, 14409]      + |         3
>>                        cds_id
>>       <CompressedIntegerList>
>>   [1]                      NA
>>   [2]                   1,2,3
>>   [3]                      NA
>
>
> To isolate the portion of the annotation that overlaps with your ranges you
> can use subsetByOverlaps(),
>
> gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"),
>               IRanges(c(23803138, 172410967, 60890373,
>                       44061025, 153627145), width=1))
> res <- subsetByOverlaps(tx, gr)
>
> Here is an example of a transcript with no gene_id
>
>>> res[8:10]
>>
>> GRanges with 3 ranges and 3 metadata columns:
>>       seqnames                 ranges strand |     tx_id
>>          <Rle>              <IRanges>  <Rle> | <integer>
>>   [1]     chr1 [172410597, 172413230]      - |      6937
>>   [2]     chr1 [172410869, 172411762]      - |      6938
>>   [3]    chr17 [ 43971748,  44105699]      + |     59914
>>                         gene_id                  cds_id
>>       <CompressedCharacterList> <CompressedIntegerList>
>>   [1]                      5279                NA,20697
>>   [2]                                             20697
>>   [3]                      4137    NA,178155,178156,...
>
>
>
> Also remember that if a range is 'intergenic' that it will not have a
> GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID.
>
> Valerie
>
>
>
>
> On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote:
>>
>> Dear all,
>>
>> I am finding some unexpected results (to me anyway) with the
>> VariantAnnotation package. Basically, there are situations where the
>> GENEID is missing when LOCATION is either coding, promoter, intron,
>> threeUTR or fiveUTR. Here is an example with five SNPs (among many
>> more). I have marked the unexpected results with "##".
>>
>>
>> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>
>> tmp <- rbind.data.frame(c("rs10917388",  "chr1",  23803138),
>>                                    c("rs1063412",   "chr1", 172410967),
>>                                    c("rs78291220",  "chr2",  60890373),
>>                                    c("rs116917239", "chr17", 44061025),
>>                                    c("rs11593",     "chrX",  153627145) )
>> colnames(tmp) <- c("rsid", "chr", "pos")
>> tmp$pos <- as.numeric( as.character(tmp$pos) )
>>
>> target <- with(tmp, GRanges(seqnames = Rle(chr),
>>                                            ranges   = IRanges(pos,
>> end=pos, names=rsid),
>>                                            strand   = Rle(strand("*")) ) )
>>
>> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene,
>> AllVariants())
>> names(loc) <- NULL
>> out <- as.data.frame(loc)
>> out$rsid <- names(target)[ out$QUERYID ]
>> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID",
>> "PRECEDEID", "FOLLOWID")]
>> out <- unique(out)
>> rownames(out) <- NULL
>> out
>>
>>             rsid  seqnames     start   LOCATION GENEID PRECEDEID FOLLOWID
>> 1   rs10917388     chr1  23803138     intron  55616      <NA>     <NA>
>> 2   rs10917388     chr1  23803138   promoter   <NA>      <NA>     <NA> ##
>>
>> 3    rs1063412     chr1 172410967     intron  92346      <NA>     <NA>
>> 4    rs1063412     chr1 172410967     intron   5279      <NA>     <NA>
>> 5    rs1063412     chr1 172410967     coding   5279      <NA>     <NA>
>> 6    rs1063412     chr1 172410967     coding   <NA>      <NA>     <NA> ##
>>
>> 7   rs78291220     chr2  60890373   promoter   <NA>      <NA>     <NA> ##
>> 8   rs78291220     chr2  60890373 intergenic   <NA>     64895   400957
>>
>> 9  rs116917239    chr17  44061025     coding   4137      <NA>     <NA>
>> 10 rs116917239    chr17  44061025     intron   4137      <NA>     <NA>
>> 11 rs116917239    chr17  44061025     coding   <NA>      <NA>     <NA> ##
>>
>> 12     rs11593     chrX 153627145     intron   6134      <NA>     <NA>
>> 13     rs11593     chrX 153627145   promoter   6134      <NA>     <NA>
>> 14     rs11593     chrX 153627145   promoter  26778      <NA>     <NA>
>> 15     rs11593     chrX 153627145   promoter   <NA>      <NA>     <NA> ##
>> 16     rs11593     chrX 153627145    fiveUTR    <NA>      <NA>     <NA> ##
>> 17     rs11593     chrX 153627145   threeUTR   <NA>      <NA>     <NA> ##
>>
>> Can anyone help explain what is happening please? Is this to be
>> expected? Thank you.
>>
>> Regards, Adai
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>