[BioC] Fisher-Yates algorithm for DNA shuffling ?
Robert Castelo
robert.castelo at upf.edu
Thu Jun 25 17:03:18 CEST 2009
hi Hervé,
On Tue, 2009-06-23 at 16:05 -0700, Hervé Pagès wrote:
> Hi Robert,
>
> Robert Castelo wrote:
> > dear list,
> >
> > i'm interested in using some proper DNA shuffling procedure like the
> > Fisher-Yates algorithm:
> >
> > R. A. Fisher and F. Yates, Example 12, Statistical Tables, London, 1938.
> >
> > Richard Durstenfeld, Algorithm 235: Random permutation, CACM 7(7):420,
> > July 1964.
>
> The Fisher & Yates or Durstenfeld algos are just particular implementations
> of a ramdom permutation generator.
> R has the sample() function to generate a random permutation. The man page
> doesn't specify what implementation is used but does that really matter?
ups, you're right, i was so focused in finding something specific for
DNA in Biostrings that i forgot to try to use the sample() function, how
embarrassing.. X-p
thanks!
robert.
> >
> > i've been googling bioconductor and the r-project for this, and
> > particularly looking at the Biostrings package, but I've failed to find
> > anything. just in case i'm missing some implementation of this DNA
> > shuffling in R, i'd like to ask the list whether anybody knows of that
> > procedure being implemented in R or, even better, in a bioconductor
> > package such that would work with DNAString objects.
> >
> > the reason for that is that i would like to simulate random DNA strings
> > preserving nucleotide composition in order to calculate empirical
> > p-values for motif scanning.
>
> Generate a random DNA string (of length 1000) with specific
> nucleotide probabilities:
>
> x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE,
> prob=c(0.2, 0.55, 0.1, 0.15)), collapse="")
> x <- DNAString(x)
>
> Then:
>
> > alphabetFrequency(x, baseOnly=TRUE)
> A C G T other
> 214 531 110 145 0
>
> If you really want to shuffle a given DNAString object x0 in order to
> preserve its nucleotide composition:
>
> x1 <- x0[sample(length(x0))]
>
> For example, shuffling the 'x' obtained above gives:
>
> x1 <- x[sample(length(x))]
>
> > alphabetFrequency(x1, baseOnly=TRUE)
> A C G T other
> 214 531 110 145 0
>
> Performance:
>
> > system.time(for (i in 1:10000) {y <- x[sample(length(x))]})
> user system elapsed
> 5.760 0.008 5.764
>
> Everything is already here and probably fast enough. No need to
> re-implement anything.
>
> Cheers,
> H.
>
>
> >
> > if it is not yet implemented i would like to suggest the package
> > maintainers of Biostrings or any other biological sequence
> > infrastructure package in Bioconductor to have it implemented in C for
> > having maximum speed with this kind of simulations. just in case it
> > eases my request here goes some simple R code implementing the
> > Fisher-Yates shuffling algorithm:
> >
> > fy <- function(seq) {
> > seq <- unlist(strsplit(seq, ""))
> > n <- length(seq)
> > i <- n
> > while (i > 0) {
> > j <- floor(runif(1)*i+1)
> > if (i != j) {
> > tmp <- seq[i]
> > seq[i] <- seq[j]
> > seq[j] <- tmp
> > }
> > i <- i - 1
> > }
> > paste(seq, collapse="")
> > }
> >
> > and this is an example of how to use it:
> >
> > # generate some random DNA sequence of length 3
> > seq <- paste(sample(c("A","C","G","T"), size=3, replace=TRUE), collapse="")
> > seq
> > [1] "ATG"
> >
> > # shuffle the sequence 10,000 times and store all the permutations
> > shuffledseqs <- c()
> > for (i in 1:10000) shuffledseqs <- c(shuffledseqs, fy(seq))
> >
> > # verify that indeed the permutations occur uniformly at random
> > base::table(shuffledseqs)/10000
> > shuffledseqs
> > AGT ATG GAT GTA TAG TGA
> > 0.1693 0.1682 0.1678 0.1629 0.1610 0.1708
> > 1/6
> > [1] 0.1666667
> >
> >
> > thanks!
> > robert.
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list