[BioC] VCF predictCoding(...) providing inconsistent results?

Murat Tasan mmuurr at gmail.com
Tue Jan 15 22:34:19 CET 2013


hi all - i found a strange case while examining a VCF object extracted
from chromosome 1 of the 1000 Genomes Project "integrated_call_set"
files.
i first extracted all variants falling in CDS regions of some of my
genes of interest into a vcf object 'vcf'.

then i ran predictCoding(vcf, TXDB, Hsapiens), where TXDB and Hsapiens
come from TxDb.Hsapiens.UCSC.hg19.knownGene and
BSgenome.Hsapiens.UCSC.hg19, respectively:

vcf_coding_info_GRanges <- predictCoding(vcf, TXDB, Hsapiens).

now, examining a single variant i got pretty confused, from two cases:

CASE 1

> vcf_coding_info_GRanges[names(vcf_coding_info_GRanges) == "rs78192809",]
GRanges with 6 ranges and 13 metadata columns:
             seqnames                 ranges strand |     paramRangeID
     varAllele       CDSLOC              PROTEINLOC   QUERYID
TXID     CDSID      GENEID   CONSEQUENCE
                <Rle>              <IRanges>  <Rle> |         <factor>
<DNAStringSet>    <IRanges> <CompressedIntegerList> <integer>
<character> <integer> <character>      <factor>
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [1564, 1564]                     522      9426
6991     20852       89866      nonsense
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [1988, 1988]                     663      9426
6992     20852       89866 nonsynonymous
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [3008, 3008]                    1003      9426
6993     20852       89866 nonsynonymous
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [1985, 1985]                     662      9426
6994     20852       89866 nonsynonymous
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [3011, 3011]                    1004      9426
6995     20852       89866    synonymous
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [2039, 2039]                     680      9426
6996     20852       89866    synonymous
                   REFCODON       VARCODON         REFAA         VARAA
             <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
  rs78192809            CAG            TAG             Q             *
  rs78192809            CCA            CTA             P             L
  rs78192809            CCA            CTA             P             L
  rs78192809            CCA            CTA             P             L
  rs78192809            CCA            CCA             P             P
  rs78192809            CCA            CCA             P             P
  ---
  seqlengths:
   chr1
     NA

CASE 2: i extracted the subset of the original vcf object and ran
predictCoding(...) on that sub-object:

> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
GRanges with 6 ranges and 13 metadata columns:
             seqnames                 ranges strand |     paramRangeID
     varAllele       CDSLOC              PROTEINLOC   QUERYID
TXID     CDSID      GENEID   CONSEQUENCE
                <Rle>              <IRanges>  <Rle> |         <factor>
<DNAStringSet>    <IRanges> <CompressedIntegerList> <integer>
<character> <integer> <character>      <factor>
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [1564, 1564]                     522         1
6991     20852       89866      nonsense
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [1988, 1988]                     663         1
6992     20852       89866 nonsynonymous
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [3008, 3008]                    1003         1
6993     20852       89866 nonsynonymous
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [1985, 1985]                     662         1
6994     20852       89866 nonsynonymous
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [3011, 3011]                    1004         1
6995     20852       89866 nonsynonymous
  rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
             T [2039, 2039]                     680         1
6996     20852       89866 nonsynonymous
                   REFCODON       VARCODON         REFAA         VARAA
             <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
  rs78192809            CAG            TAG             Q             *
  rs78192809            CCA            CTA             P             L
  rs78192809            CCA            CTA             P             L
  rs78192809            CCA            CTA             P             L
  rs78192809            CCA            CTA             P             L
  rs78192809            CCA            CTA             P             L
  ---
  seqlengths:
   chr1
     NA

for nearly all fields, especially the CDSID, TXID, and the location
data of the variant in the sequences, the results are the same.
BUT, the CONSEQUENCE, VARCODON, and VARAA are inconsistent between the
two calls.

have i done something wrong in the second attempt (i.e. does
subsetting on the VCF variants first cause a problem when submitting
the result to predictCoding(...))?

thanks for any help/insight.

-m



More information about the Bioconductor mailing list