[BioC] mutant allele read counts

Sun Jun 15 05:40:12 CEST 2014

On 06/14/14 20:16, Valerie Obenchain wrote:
> Thanks for providing the test files. The subset bug you hit was fixed in
> devel a few weeks ago. I've ported it to release as 1.10.2 which should
> be available via biocLite() Monday noon-ish or from svn immediately.
>
> With 1.10.2 I can read the snv file fine:
>
> vr <-
> readVcfAsVRanges("3_Middle_lobe_tumor--2_mucosal_normal.snv.mutect.v1.1.4.annotated.vcf",
> "")
>
>>> vr[1:3,]
>> VRanges with 3 ranges and 0 metadata columns:
>>       seqnames           ranges strand         ref              alt
>>          <Rle>        <IRanges>  <Rle> <character> <characterOrRle>
>>   [1]        1 [109641, 109641]      +           A                G
>>   [2]        1 [526561, 526561]      +           T                G
>>   [3]        1 [691958, 691958]      +           G                A
>>           totalDepth       refDepth       altDepth         sampleNames
>>       <integerOrRle> <integerOrRle> <integerOrRle>       <factorOrRle>
>>   [1]           <NA>             31              5 3_Middle_lobe_tumor
>>   [2]           <NA>             56              5 3_Middle_lobe_tumor
>>   [3]           <NA>             41              7 3_Middle_lobe_tumor
>>       softFilterMatrix
>>               <matrix>
>>   [1]
>>   [2]
>>   [3]
>
> Trying to read the indel file gives this error:
>
>  >  vr <-
> readVcfAsVRanges("3_Middle_lobe_tumor--2_mucosal_normal.indel.sominddet.v2.3-9.vcf",
> "")
> Error in validObject(.Object) :
>    invalid class “VRanges” object: 'refDepth' must be non-negative
>
>
> Look at the depth variable only with readGeno().
>
>  > fl <- "3_Middle_lobe_tumor-2_mucosal_normal.indel.sominddet.v2.3-9.vcf"
>  > ad <- readGeno(fl, "AD")
>  > dim(ad)
> [1] 547698      2      2
>  > min(ad)
> [1] -23
>
> The negative depth may need some investigating.

Sorry, this isn't clear. What I should have said was readVcf() 
readVcfAsVRanges() operate as expected on this file. readVcfAsVRanges() 
requires AD to be positive if it is present. If the negative AD values 
don't concern you another approach would be to use readVcf() with a 
param that drops all info and geno variables.

param <- ScanVcfParam(info=NA, geno=NA)
vcf <- readVcf(fl, "genome", param=param)

Then coerce to a VRanges.

Valerie

>
> Valerie
>
>>> sessionInfo()
>> R version 3.1.0 Patched (2014-04-18 r65405)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> [8] base
>>
>> other attached packages:
>> [1] VariantAnnotation_1.10.2 Rsamtools_1.16.0         Biostrings_2.32.0
>> [4] XVector_0.4.0            GenomicRanges_1.16.3     GenomeInfoDb_1.0.2
>> [7] IRanges_1.22.9           BiocGenerics_0.10.0
>>
>> loaded via a namespace (and not attached):
>>  [1] AnnotationDbi_1.26.0    BatchJobs_1.2           BBmisc_1.6
>>  [4] Biobase_2.24.0          BiocParallel_0.6.1      biomaRt_2.20.0
>>  [7] bitops_1.0-6            brew_1.0-6              BSgenome_1.32.0
>> [10] codetools_0.2-8         DBI_0.2-7               digest_0.6.4
>> [13] fail_1.2                foreach_1.4.2
>> GenomicAlignments_1.0.1
>> [16] GenomicFeatures_1.16.2  iterators_1.0.7         plyr_1.8.1
>> [19] Rcpp_0.11.2             RCurl_1.95-4.1          RSQLite_0.11.4
>> [22] rtracklayer_1.24.2      sendmailR_1.1-2         stats4_3.1.0
>> [25] stringr_0.6.2           tools_3.1.0             XML_3.98-1.1
>> [28] zlibbioc_1.10.0
>
>
> On 06/13/2014 04:33 PM, Valerie Obenchain wrote:
>> Hi,
>>
>> I see you've got an old version of the package. The release version is
>> 1.10.1. FYI you can see all release versions here:
>>
>> http://www.bioconductor.org/checkResults/release/bioc-LATEST/
>>
>> Please update R and reinstall packages with biocLite(). If you need any
>> help with that guidelines are here:
>> http://www.bioconductor.org/install/
>>
>> If you want to send me a small test file offline I can confirm that the
>> current version can read it.
>>
>>
>> Valerie
>>
>>
>> On 06/13/2014 03:06 PM, Murli wrote:
>>> Sorry, forgot to post the code contained in the R file in the earlier
>>> post. These are the few lines in vcfToCravat.R that I am sourcing.
>>> library(VariantAnnotation)
>>> DataDir="../Project_NAI_01117_TNWGS/Sample_3_Middle_lobe_tumor/analysis/"
>>>
>>> vcfFile=paste(DataDir,"3_Middle_lobe_tumor--2_mucosal_normal.snv.mutect.v1.1.4.annotated.vcf",sep
>>>
>>>
>>> ="")
>>> vcfFile=paste(ataDir,"3_Middle_lobe_tumor--2_mucosal_normal.indel.sominddet.v2.3-9.vcf",sep="")
>>>
>>>
>>> vr <- readVcfAsVRanges(vcfFile, "hg19")
>>> df <- as.data.frame(vr)
>>>
>>>
>>> On Fri, Jun 13, 2014 at 5:47 PM, Murli <murlinair at gmail.com
>>> <mailto:murlinair at gmail.com>> wrote:
>>>
>>>     Hi Valerie,
>>>     Thanks for the help. I am getting the following errors when I am
>>>     reading the vcf files.
>>>       vr <- readVcfAsVRanges(vcfFile, "hg19")
>>>     Error in lapply(ivar[inms], drop) :
>>>        error in evaluating the argument 'X' in selecting a method for
>>>     function 'lapply': Error in normalizeSingleBracketSubscript(j, x) :
>>>        subscript contains invalid names
>>>
>>>     With another file I am getting the following
>>>      > source("vcfToCravat.R")
>>>     Error in validObject(.Object) :
>>>        invalid class VRanges
>>>
>>>     Cheers../Murli
>>>
>>>
>>>     On Fri, Jun 13, 2014 at 3:29 PM, Valerie Obenchain
>>>     <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>>>
>>>         Hi,
>>>
>>>         Use readVcfAsVRanges() then coerce to a data.frame.
>>>
>>>         fl <- system.file("extdata", "chr7-sub.vcf.gz",
>>>         package="VariantAnnotation")
>>>         vr <- readVcfAsVRanges(fl, "hg19")
>>>         df <- as.data.frame(vr)
>>>
>>>         You'll have some extra columns in the data.frame but you can
>>>         remove / rename columns as necessary.
>>>
>>>         Valerie
>>>
>>>
>>>
>>>
>>>
>>>         On 06/13/2014 10:46 AM, Murli [guest] wrote:
>>>
>>>             Hi,
>>>             I am interested in extracting information for functional
>>>             annotation using CRAVAT. It requires the data to be in the
>>>             following format.
>>>             ==============================__=============
>>>             # UID / Chr. / Position / Strand / Ref. base / Alt. base /
>>>             Sample ID (optional)
>>>             TR1     chr17   7577506 -       G       T       TCGA-02-0231
>>>             TR2     chr10   123279680       -       G       A
>>>             TCGA-02-3512
>>>             TR3     chr13   49033967        +       C       A
>>>             TCGA-02-3532
>>>             TR4     chr7    116417505       +       G       T
>>>             TCGA-02-1523
>>>             TR5     chr7    140453136       -       T       A
>>>             TCGA-02-0023
>>>             TR6     chr17   37880998        +       G       T
>>>             TCGA-02-0252
>>>             Ins1 chr17      37880998        +       G       GT
>>>               TCGA-02-0252
>>>             Del1 chr17      37880998        +       GA      G
>>>             TCGA-02-0252
>>>             CSub1 chr2      39871235        +       ATGCT   GA
>>>               TCGA-02-0252
>>>
>>>             ==============================__=================
>>>
>>> http://www.cravat.us/help.jsp?__chapter=how_to_cite&article=#
>>> <http://www.cravat.us/help.jsp?chapter=how_to_cite&article=#>
>>>
>>>             I am trying to extract this information from vcf files
>>>             generated by mutect. I am using VariantAnnotation extract
>>>             this information. I have read the file using readVcf(), and
>>>             renamed the chromosomes according to txdb.
>>>
>>>             rowData(newVcfData)
>>>             GRanges with 62991 ranges and 5 metadata columns:
>>>                                 seqnames                 ranges strand
>>>             | paramRangeID
>>>                                    <Rle>              <IRanges>  <Rle>
>>>             |     <factor>
>>>                    1:109641_A/G     chr1       [109641, 109641]      *
>>>             |         <NA>
>>>                    1:526561_T/G     chr1       [526561, 526561]      *
>>>             |         <NA>
>>>                    1:691958_G/A     chr1       [691958, 691958]      *
>>>             |         <NA>
>>>                    1:763781_A/T     chr1       [763781, 763781]      *
>>>             |         <NA>
>>>                       rs6594026     chr1       [782981, 782981]      *
>>>             |         <NA>
>>>                             ...      ...                    ...    ...
>>>             ...          ...
>>>                        rs480725     chrX [154903224, 154903224]      *
>>>             |         <NA>
>>>                 X:154925893_C/T     chrX [154925893, 154925893]      *
>>>             |         <NA>
>>>                 X:155038107_C/G     chrX [155038107, 155038107]      *
>>>             |         <NA>
>>>                 X:155204257_G/T     chrX [155204257, 155204257]      *
>>>             |         <NA>
>>>                 X:155234730_T/C     chrX [155234730, 155234730]      *
>>>             |         <NA>
>>>                                            REF                ALT
>>>               QUAL      FILTER
>>>                                 <DNAStringSet> <DNAStringSetList>
>>>             <numeric> <character>
>>>                    1:109641_A/G              A                  G
>>>               8.90        PASS
>>>                    1:526561_T/G              T                  G
>>>               9.19        PASS
>>>                    1:691958_G/A              G                  A
>>>             13.74        PASS
>>>                    1:763781_A/T              A                  T
>>>             16.03        PASS
>>>                       rs6594026              C                  T
>>>             11.24        PASS
>>>                             ...            ...                ...
>>>             ...         ...
>>>                        rs480725              A                  T
>>>               6.39        PASS
>>>                 X:154925893_C/T              C                  T
>>>               6.53        PASS
>>>                 X:155038107_C/G              C                  G
>>>               6.64        PASS
>>>                 X:155204257_G/T              G                  T
>>>               6.35        PASS
>>>                 X:155234730_T/C              T                  C
>>>               6.51        PASS
>>>                 ---
>>>                 seqlengths:
>>>                   chr1 chr10 chr11 chr12 chr13 chr14 ...  chr5  chr6
>>>               chr7  chr8  chr9  chrX
>>>                     NA    NA    NA    NA    NA    NA ...    NA    NA
>>>               NA    NA    NA    NA
>>>
>>>
>>>             Can the information be extracted using VariantAnnotation()?
>>>             I would appreciate your help with this.
>>>             Thanks ../Murli
>>>
>>>
>>>
>>>                -- output of sessionInfo():
>>>
>>>                 sessionInfo()
>>>
>>>             R version 3.0.2 (2013-09-25)
>>>             Platform: x86_64-redhat-linux-gnu (64-bit)
>>>
>>>             locale:
>>>                [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>                [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>                [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>                [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>                [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>             [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>>             attached base packages:
>>>             [1] parallel  stats     graphics  grDevices utils
>>>             datasets  methods
>>>             [8] base
>>>
>>>             other attached packages:
>>>                [1] TxDb.Hsapiens.UCSC.hg19.__knownGene_2.10.1
>>>                [2] GenomicFeatures_1.14.5
>>>                [3] AnnotationDbi_1.24.0
>>>                [4] Biobase_2.22.0
>>>                [5] VariantAnnotation_1.8.13
>>>                [6] Rsamtools_1.14.3
>>>                [7] Biostrings_2.30.1
>>>                [8] GenomicRanges_1.14.4
>>>                [9] XVector_0.2.0
>>>             [10] IRanges_1.20.7
>>>             [11] BiocGenerics_0.8.0
>>>
>>>             loaded via a namespace (and not attached):
>>>                [1] biomaRt_2.18.0     bitops_1.0-6       BSgenome_1.30.0
>>>                 DBI_0.2-7
>>>                [5] RCurl_1.95-4.1     RSQLite_0.11.4
>>>             rtracklayer_1.22.7 stats4_3.0.2
>>>                [9] tools_3.0.2        XML_3.98-1.1       zlibbioc_1.8.0
>>>
>>>
>>>             --
>>>             Sent via the guest posting facility at bioconductor.org
>>>             <http://bioconductor.org>.
>>>
>>>
>>>
>>>         --
>>>         Valerie Obenchain
>>>         Program in Computational Biology
>>>         Fred Hutchinson Cancer Research Center
>>>         1100 Fairview Ave. N, Seattle, WA 98109
>>>
>>>         Email: vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>
>>>         Phone: (206) 667-3158 <tel:%28206%29%20667-3158>
>>>
>>>
>>>
>>
>>
>
>