[BioC] VariantAnnotation: Specifying 'seqinfo' at import with 'readVcf'
Julian Gehring
julian.gehring at embl.de
Sun Nov 3 21:41:13 CET 2013
Hi Michael,
As a minimal example, I'll try to import the non-coding COSMIC VCF
(ftp://ngs.sanger.ac.uk/production/cosmic/CosmicNonCodingVariants_v67_20131024.vcf.gz)
with a NCBI 37 seqinfo object.
#+BEGIN_SRC R
> seq_info
Seqinfo of length 25
seqnames seqlengths isCircular genome
1 249250621 FALSE GRCh37
2 243199373 FALSE GRCh37
3 198022430 FALSE GRCh37
4 191154276 FALSE GRCh37
5 180915260 FALSE GRCh37
... ... ... ...
21 48129895 FALSE GRCh37
22 51304566 FALSE GRCh37
X 155270560 FALSE GRCh37
Y 59373566 FALSE GRCh37
MT 16569 TRUE GRCh37
> x = readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info)
Warning in .bcfHeaderAsSimpleList(header) :
duplicate keys in header will be forced to unique rownames
Error in makeNewSeqnames(x, new2old = new2old, seqlevels(value)) :
when 'new2old' is NULL, the first elements in the
supplied 'seqlevels' must be identical to 'seqlevels(x)'
> traceback()
8: stop("when 'new2old' is NULL, the first elements in the\n", "
supplied 'seqlevels' must be identical to 'seqlevels(x)'")
7: makeNewSeqnames(x, new2old = new2old, seqlevels(value))
6: `seqinfo<-`(`*tmp*`, value = <S4 object of class "Seqinfo">)
5: `seqinfo<-`(`*tmp*`, value = <S4 object of class "Seqinfo">)
4: .scanVcfToVCF(scanVcf(file, param = param), file, genome, param)
3: .readVcf(file, genome, param = ScanVcfParam())
2: readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info)
1: readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info)
#+END_SRC
Please let me know if you need any more information.
Best wishes
Julian
On 11/03/2013 07:12 PM, Michael Lawrence wrote:
> This was my doing. I guess merge,Seqinfo,Seqinfo does not support
> different orderings? Would help to see the actual error message and
> stack trace. Any advice Herve?
>
> Michael
>
>
> On Sun, Nov 3, 2013 at 6:55 AM, Julian Gehring <julian.gehring at embl.de
> <mailto:julian.gehring at embl.de>> wrote:
>
> Hi Valerie and Herve,
>
> When providing a 'Seqinfo' object as the 'genome' argument to
> 'readVcf', would it be possible to not have the requirement of the
> VCF header and the seqinfo object to have the same ordering? As the
> moment (VariantAnnotation_1.8.2), the import will fail if this is
> not the case. However, this requires the user to know the ordering
> of the seqnames in the VCF file beforehand. In the ideal case, the
> readVcf function could reorder the seqnames in the way as provided
> by the 'genome' argument.
>
>
> Best wishes
> Julian
>
>
> On 09/28/2013 12:57 AM, Hervé Pagès wrote:
>
> Hi Julian,
>
> With the latest devel version of GenomicRanges (1.13.48, should
> become
> available via biocLite() in the next 24 hours or so), setting the
> seqinfo of a big VCF object should be much faster but your mileage
> may vary. Let us know if you still find it slow enough to justify a
> mechanism for letting the user specify the seqinfo upfront when
> using
> readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and
> accept
> everything it supports now plus a Seqinfo object).
>
> Cheers,
> H.
>
> On 09/24/2013 10:00 AM, Julian Gehring wrote:
>
> Hi Valerie,
>
> Thanks for clarifying. No, there is not currently a way
> to do this.
>
> The 'seqinfo' on the rowData(vcf) should not be
> difficult to change. Can
> you provide more detail as to (1) why you are changing
> it (did readVcf()
> import something incorrectly or ?) and (2) what
> operations on the
> 'seqinfo' are taking a long time.
>
>
> 'readVcf' did everything in a correct way. I needed to add the
> information about the genome, circularity, and seqlengths
> based on
> external annotitation, since it is not part of the VCF file.
>
> I can't tell you at the moment where 'seqinfo <-' spends
> most of its
> time. I'll profile it and get back to you.
>
> Thank your for your quick answer and your help!
>
> Best wishes
> Julian
>
>
> Thanks.
> Valerie
>
>
> Best wishes
> Julian
>
>
> On 09/24/2013 06:31 PM, Valerie Obenchain wrote:
>
> Hi Julian,
>
> On 09/24/2013 02:29 AM, Julian Gehring wrote:
>
> Hi,
>
> Is there a direct way to specifiy the
> 'seqinfo' of a genome for the
> import of a VCF file using 'readVcf'?
>
>
> I think the question is how to read in a subset of
> chromosomes/positions
> from a vcf file without an accompanying tabix
> index. You can't.
> readVcf() requires an index when subsets are
> defined by
> chromosome/position. However you can read in
> subsets defined by INFO
> and/or GENO fields without an index.
>
> Approaches:
> (1) create index with ?indexTabix and specify
> 'which' in ScanVcfParam
> (2) use ?filterVcf to write out a new file of
> records of interest
>
> I'm aware that one can change it
> with the 'seqinfo' method afterwards, but
> for large VCF files this
> can
> take a significant amount of time.
>
>
> What operation is taking along time? Subsetting
> the VCF object by
> chromosome?
>
> Valerie
>
>
> An alternative would be to sneak it in by
> the 'which' arguments, such
> as:
>
> readVcf(file, genome, ScanVcfParam(which =
> as(seq_info, "GRanges")))
>
> but this requires the file to be indexed
> beforehand.
>
> Best wishes
> Julian
>
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
More information about the Bioconductor
mailing list