[BioC] VCF predictCoding(...) providing inconsistent results?
Murat Tasan
mmuurr at gmail.com
Tue Jan 15 22:57:13 CET 2013
i should add that i actually had to pre-process the vcf file a bit to
remedy two problems:
(1) i changed the seqlevels like so:
seqlevels(vcf) <- sprintf("chr%s", seqlevels(vcf))
and (2) i had to remove structural variants, but i think i did this
without corrupting the resulting vcf structure... here's that bit of
code:
structural_lix <- rep(FALSE, length(alt(vcf)))
structural_lix[sapply(alt(vcf), function(x) any(grepl("<", x)))] <- TRUE
table(structural_lix) ## just to see how many entries will be
removed.
vcf <- vcf[!structural_lix,]
alt(vcf) <- VariantAnnotation:::.toDNAStringSetList(unlist(alt(vcf),
use.names = FALSE))
(the structural-variant removal code was inspired and closely follows
that described here:
https://stat.ethz.ch/pipermail/bioconductor/2012-October/048448.html)
after these two steps, the vcf object seemed to be perfectly happy,
until the predictCoding(...) strangeness described in the previous
post.
cheers,
-m
On Tue, Jan 15, 2013 at 4:34 PM, Murat Tasan <mmuurr at gmail.com> wrote:
> hi all - i found a strange case while examining a VCF object extracted
> from chromosome 1 of the 1000 Genomes Project "integrated_call_set"
> files.
> i first extracted all variants falling in CDS regions of some of my
> genes of interest into a vcf object 'vcf'.
>
> then i ran predictCoding(vcf, TXDB, Hsapiens), where TXDB and Hsapiens
> come from TxDb.Hsapiens.UCSC.hg19.knownGene and
> BSgenome.Hsapiens.UCSC.hg19, respectively:
>
> vcf_coding_info_GRanges <- predictCoding(vcf, TXDB, Hsapiens).
>
> now, examining a single variant i got pretty confused, from two cases:
>
> CASE 1
>
>> vcf_coding_info_GRanges[names(vcf_coding_info_GRanges) == "rs78192809",]
> GRanges with 6 ranges and 13 metadata columns:
> seqnames ranges strand | paramRangeID
> varAllele CDSLOC PROTEINLOC QUERYID
> TXID CDSID GENEID CONSEQUENCE
> <Rle> <IRanges> <Rle> | <factor>
> <DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
> <character> <integer> <character> <factor>
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [1564, 1564] 522 9426
> 6991 20852 89866 nonsense
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [1988, 1988] 663 9426
> 6992 20852 89866 nonsynonymous
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [3008, 3008] 1003 9426
> 6993 20852 89866 nonsynonymous
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [1985, 1985] 662 9426
> 6994 20852 89866 nonsynonymous
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [3011, 3011] 1004 9426
> 6995 20852 89866 synonymous
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [2039, 2039] 680 9426
> 6996 20852 89866 synonymous
> REFCODON VARCODON REFAA VARAA
> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
> rs78192809 CAG TAG Q *
> rs78192809 CCA CTA P L
> rs78192809 CCA CTA P L
> rs78192809 CCA CTA P L
> rs78192809 CCA CCA P P
> rs78192809 CCA CCA P P
> ---
> seqlengths:
> chr1
> NA
>
> CASE 2: i extracted the subset of the original vcf object and ran
> predictCoding(...) on that sub-object:
>
>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
> GRanges with 6 ranges and 13 metadata columns:
> seqnames ranges strand | paramRangeID
> varAllele CDSLOC PROTEINLOC QUERYID
> TXID CDSID GENEID CONSEQUENCE
> <Rle> <IRanges> <Rle> | <factor>
> <DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
> <character> <integer> <character> <factor>
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [1564, 1564] 522 1
> 6991 20852 89866 nonsense
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [1988, 1988] 663 1
> 6992 20852 89866 nonsynonymous
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [3008, 3008] 1003 1
> 6993 20852 89866 nonsynonymous
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [1985, 1985] 662 1
> 6994 20852 89866 nonsynonymous
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [3011, 3011] 1004 1
> 6995 20852 89866 nonsynonymous
> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
> T [2039, 2039] 680 1
> 6996 20852 89866 nonsynonymous
> REFCODON VARCODON REFAA VARAA
> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
> rs78192809 CAG TAG Q *
> rs78192809 CCA CTA P L
> rs78192809 CCA CTA P L
> rs78192809 CCA CTA P L
> rs78192809 CCA CTA P L
> rs78192809 CCA CTA P L
> ---
> seqlengths:
> chr1
> NA
>
> for nearly all fields, especially the CDSID, TXID, and the location
> data of the variant in the sequences, the results are the same.
> BUT, the CONSEQUENCE, VARCODON, and VARAA are inconsistent between the
> two calls.
>
> have i done something wrong in the second attempt (i.e. does
> subsetting on the VCF variants first cause a problem when submitting
> the result to predictCoding(...))?
>
> thanks for any help/insight.
>
> -m
More information about the Bioconductor
mailing list