[BioC] VCF predictCoding(...) providing inconsistent results?

Tue Jan 15 22:57:13 CET 2013

i should add that i actually had to pre-process the vcf file a bit to
remedy two problems:

(1) i changed the seqlevels like so:

  seqlevels(vcf) <- sprintf("chr%s", seqlevels(vcf))

and (2) i had to remove structural variants, but i think i did this
without corrupting the resulting vcf structure... here's that bit of
code:

  structural_lix <- rep(FALSE, length(alt(vcf)))
  structural_lix[sapply(alt(vcf), function(x) any(grepl("<", x)))] <- TRUE
  table(structural_lix) ## just to see how many entries will be
removed.

  vcf <- vcf[!structural_lix,]
  alt(vcf) <- VariantAnnotation:::.toDNAStringSetList(unlist(alt(vcf),
use.names = FALSE))

(the structural-variant removal code was inspired and closely follows
that described here:
https://stat.ethz.ch/pipermail/bioconductor/2012-October/048448.html)

after these two steps, the vcf object seemed to be perfectly happy,
until the predictCoding(...) strangeness described in the previous
post.

cheers,

-m

On Tue, Jan 15, 2013 at 4:34 PM, Murat Tasan <mmuurr at gmail.com> wrote:
> hi all - i found a strange case while examining a VCF object extracted
> from chromosome 1 of the 1000 Genomes Project "integrated_call_set"
> files.
> i first extracted all variants falling in CDS regions of some of my
> genes of interest into a vcf object 'vcf'.
>
> then i ran predictCoding(vcf, TXDB, Hsapiens), where TXDB and Hsapiens
> come from TxDb.Hsapiens.UCSC.hg19.knownGene and
> BSgenome.Hsapiens.UCSC.hg19, respectively:
>
> vcf_coding_info_GRanges <- predictCoding(vcf, TXDB, Hsapiens).
>
> now, examining a single variant i got pretty confused, from two cases:
>
> CASE 1
>
>> vcf_coding_info_GRanges[names(vcf_coding_info_GRanges) == "rs78192809",]
> GRanges with 6 ranges and 13 metadata columns:
>              seqnames                 ranges strand |     paramRangeID
>      varAllele       CDSLOC              PROTEINLOC   QUERYID
> TXID     CDSID      GENEID   CONSEQUENCE
>                 <Rle>              <IRanges>  <Rle> |         <factor>
> <DNAStringSet>    <IRanges> <CompressedIntegerList> <integer>
> <character> <integer> <character>      <factor>
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [1564, 1564]                     522      9426
> 6991     20852       89866      nonsense
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [1988, 1988]                     663      9426
> 6992     20852       89866 nonsynonymous
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [3008, 3008]                    1003      9426
> 6993     20852       89866 nonsynonymous
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [1985, 1985]                     662      9426
> 6994     20852       89866 nonsynonymous
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [3011, 3011]                    1004      9426
> 6995     20852       89866    synonymous
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [2039, 2039]                     680      9426
> 6996     20852       89866    synonymous
>                    REFCODON       VARCODON         REFAA         VARAA
>              <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
>   rs78192809            CAG            TAG             Q             *
>   rs78192809            CCA            CTA             P             L
>   rs78192809            CCA            CTA             P             L
>   rs78192809            CCA            CTA             P             L
>   rs78192809            CCA            CCA             P             P
>   rs78192809            CCA            CCA             P             P
>   ---
>   seqlengths:
>    chr1
>      NA
>
> CASE 2: i extracted the subset of the original vcf object and ran
> predictCoding(...) on that sub-object:
>
>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
> GRanges with 6 ranges and 13 metadata columns:
>              seqnames                 ranges strand |     paramRangeID
>      varAllele       CDSLOC              PROTEINLOC   QUERYID
> TXID     CDSID      GENEID   CONSEQUENCE
>                 <Rle>              <IRanges>  <Rle> |         <factor>
> <DNAStringSet>    <IRanges> <CompressedIntegerList> <integer>
> <character> <integer> <character>      <factor>
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [1564, 1564]                     522         1
> 6991     20852       89866      nonsense
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [1988, 1988]                     663         1
> 6992     20852       89866 nonsynonymous
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [3008, 3008]                    1003         1
> 6993     20852       89866 nonsynonymous
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [1985, 1985]                     662         1
> 6994     20852       89866 nonsynonymous
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [3011, 3011]                    1004         1
> 6995     20852       89866 nonsynonymous
>   rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>              T [2039, 2039]                     680         1
> 6996     20852       89866 nonsynonymous
>                    REFCODON       VARCODON         REFAA         VARAA
>              <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
>   rs78192809            CAG            TAG             Q             *
>   rs78192809            CCA            CTA             P             L
>   rs78192809            CCA            CTA             P             L
>   rs78192809            CCA            CTA             P             L
>   rs78192809            CCA            CTA             P             L
>   rs78192809            CCA            CTA             P             L
>   ---
>   seqlengths:
>    chr1
>      NA
>
> for nearly all fields, especially the CDSID, TXID, and the location
> data of the variant in the sequences, the results are the same.
> BUT, the CONSEQUENCE, VARCODON, and VARAA are inconsistent between the
> two calls.
>
> have i done something wrong in the second attempt (i.e. does
> subsetting on the VCF variants first cause a problem when submitting
> the result to predictCoding(...))?
>
> thanks for any help/insight.
>
> -m