[BioC] Best practices to find intersection among variants
Julian Gehring
julian.gehring at embl.de
Tue Aug 26 18:58:27 CEST 2014
Hi Marco,
Getting a bit away from Bioconductor: You could also genotype all
samples together with the GATK UGT which will give you the information
about variant alleles over in each samples (as described in the
documentation and best practices). This is helpful for eliminating
false positive/negative calls.
Best
Julian
On 26.08.2014 07:19, Blanchette, Marco wrote:
> All right! Thank you guys.
>
> Here is my little pipeline
> 1) slurping in the vcf files in a directory
> 2) Filtering out the one that didn’t passed the GATK filtering step (This
> could be probably be done during reading from file, couldn’t figure out
> how yet…)
> 3) finding the common one among all the files
> 4) writing back the file to disk
>
> Feel free to comment if you think there would be better ways to do that
> (It’s already better than pairwise comparison of file using java command
> line :-})
>
>
> library(VariantAnnotation)
>
> vcfDir <- "./filtSNPs"
>
> vcfFiles <- list.files(vcfDir,"\\.vcf$",full=TRUE)
>
> vcfs <- lapply(vcfFiles,readVcf,"spombe")
>
> filtered.vcfs <- lapply(vcfs,function(vcf) vcf[filt(vcf) == 'PASS'])
>
> intersectVCF <- function(v1,v2){
> m <- match(as(v1,'VRanges'),as(v2,'VRanges'))
> v2[m[!is.na(m)]]
> }
>
> commonVCFs <- Reduce(intersectVCF,filtered.vcfs)
>
> writeVcf(commonVCFs,file.path(vcfDir,"commonSNP.vcf"),index=TRUE)
>
>
> Thanks
>
More information about the Bioconductor
mailing list