[BioC] VCF predictCoding(...) providing inconsistent results?
Murat Tasan
mmuurr at gmail.com
Tue Jan 15 22:34:19 CET 2013
hi all - i found a strange case while examining a VCF object extracted
from chromosome 1 of the 1000 Genomes Project "integrated_call_set"
files.
i first extracted all variants falling in CDS regions of some of my
genes of interest into a vcf object 'vcf'.
then i ran predictCoding(vcf, TXDB, Hsapiens), where TXDB and Hsapiens
come from TxDb.Hsapiens.UCSC.hg19.knownGene and
BSgenome.Hsapiens.UCSC.hg19, respectively:
vcf_coding_info_GRanges <- predictCoding(vcf, TXDB, Hsapiens).
now, examining a single variant i got pretty confused, from two cases:
CASE 1
> vcf_coding_info_GRanges[names(vcf_coding_info_GRanges) == "rs78192809",]
GRanges with 6 ranges and 13 metadata columns:
seqnames ranges strand | paramRangeID
varAllele CDSLOC PROTEINLOC QUERYID
TXID CDSID GENEID CONSEQUENCE
<Rle> <IRanges> <Rle> | <factor>
<DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
<character> <integer> <character> <factor>
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [1564, 1564] 522 9426
6991 20852 89866 nonsense
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [1988, 1988] 663 9426
6992 20852 89866 nonsynonymous
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [3008, 3008] 1003 9426
6993 20852 89866 nonsynonymous
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [1985, 1985] 662 9426
6994 20852 89866 nonsynonymous
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [3011, 3011] 1004 9426
6995 20852 89866 synonymous
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [2039, 2039] 680 9426
6996 20852 89866 synonymous
REFCODON VARCODON REFAA VARAA
<DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
rs78192809 CAG TAG Q *
rs78192809 CCA CTA P L
rs78192809 CCA CTA P L
rs78192809 CCA CTA P L
rs78192809 CCA CCA P P
rs78192809 CCA CCA P P
---
seqlengths:
chr1
NA
CASE 2: i extracted the subset of the original vcf object and ran
predictCoding(...) on that sub-object:
> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
GRanges with 6 ranges and 13 metadata columns:
seqnames ranges strand | paramRangeID
varAllele CDSLOC PROTEINLOC QUERYID
TXID CDSID GENEID CONSEQUENCE
<Rle> <IRanges> <Rle> | <factor>
<DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
<character> <integer> <character> <factor>
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [1564, 1564] 522 1
6991 20852 89866 nonsense
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [1988, 1988] 663 1
6992 20852 89866 nonsynonymous
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [3008, 3008] 1003 1
6993 20852 89866 nonsynonymous
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [1985, 1985] 662 1
6994 20852 89866 nonsynonymous
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [3011, 3011] 1004 1
6995 20852 89866 nonsynonymous
rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
T [2039, 2039] 680 1
6996 20852 89866 nonsynonymous
REFCODON VARCODON REFAA VARAA
<DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
rs78192809 CAG TAG Q *
rs78192809 CCA CTA P L
rs78192809 CCA CTA P L
rs78192809 CCA CTA P L
rs78192809 CCA CTA P L
rs78192809 CCA CTA P L
---
seqlengths:
chr1
NA
for nearly all fields, especially the CDSID, TXID, and the location
data of the variant in the sequences, the results are the same.
BUT, the CONSEQUENCE, VARCODON, and VARAA are inconsistent between the
two calls.
have i done something wrong in the second attempt (i.e. does
subsetting on the VCF variants first cause a problem when submitting
the result to predictCoding(...))?
thanks for any help/insight.
-m
More information about the Bioconductor
mailing list