[BioC] Bioc Interface for 1,000 Genomes SNP Frequencies
Valerie Obenchain
vobencha at fhcrc.org
Tue Jul 17 17:36:47 CEST 2012
Hi Tim,
1000 Genomes data can be read in with readVcf(). The 'param' argument
allows you to specify chromosomes and/or genome positions you are
interested in
Using a sample file from 1000 genomes,
library(VariantAnnotation)
fl <-
"ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz"
genomes <- readVcf(TabixFile(fl), "hg19", param=GRanges("12",
IRanges(101266000, 101422000)))
The data are read into a VCF obejct. See ?VCF for class details. If a
snp has an rsid, it will appear as the rowname. If not, the rowname is a
concatenation of chromosome:position.
> genomes
class: VCF
dim: 1325 174
genome: hg19
exptData(1): header
fixed(4): REF ALT QUAL FILTER
info(6): AF DP ... AFR_R2 ASN_R2
geno(7): AD DP ... GD OG
rownames(1325): rs75462666 rs78597949 ... 12:101421722 rs67962959
rowData values names(1): paramRangeID
colnames(174): HG00553 HG00554 ... NA19982 NA20414
colData names(1): Samples
We don't have a function that computes the frequencies but that can be
accomplished fairly easily. The genotype data can be accessed with the
'geno' function.
> geno(genomes)
SimpleList of length 7
names(7): AD DP GL GQ GT GD OG
> geno(genomes)$GT[1:5,1:3]
HG00553 HG00554 HG00637
rs75462666 "0|0" "0|0" "0|0"
rs78597949 "0|0" "0|0" "0|0"
rs78693793 "0|0" "0|0" "0|0"
12:101266725 "0|0" "0|0" "0|0"
12:101266729 "0|0" "0|0" "0|0"
The frequency of the alternate allele is computed as [ 0|1 + 2*(1|1)] /
2*(total number of individuals). Maybe something like,
heterozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% c("0|1",
"1|0"))
homozyg <- apply(geno(genomes)$GT, 2, function(x) x %in% "1|1")
nsub <- ncol(geno(genomes)$GT)
freq <- (heterozyg + 2*homozyg) / 2*nsub
Valerie
On 07/17/2012 08:09 AM, Vincent Carey wrote:
> Val Obenchain can reply more definitively, but, briefly, VariantAnnotation
> package has a scanVcf capability that
> allows survey of vcf archives with managed memory consumption. If you do
> not see enough details in the vignette
> to solve this use case efficiently, let us know.
>
> On Tue, Jul 17, 2012 at 10:53 AM, Timothy Duff<timothy.duff at ncf.edu> wrote:
>
>> Hi. I have a set of rsIDs for which I'm interested in obtaining allele
>> frequencies for (in a particular population surveyed by 1kGP.) The
>> corresponding .vcf files are pretty memory consumptive. Do there currently
>> exist basic BC tools that could help with this sort of thing? Thank you all
>> for your consideration.
>>
>> --
>> Tim
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list