[BioC] Fisher-Yates algorithm for DNA shuffling ?

Thu Jun 25 17:03:18 CEST 2009

hi Hervé,

On Tue, 2009-06-23 at 16:05 -0700, Hervé Pagès wrote:
> Hi Robert,
> 
> Robert Castelo wrote:
> > dear list,
> > 
> > i'm interested in using some proper DNA shuffling procedure like the
> > Fisher-Yates algorithm:
> > 
> > R. A. Fisher and F. Yates, Example 12, Statistical Tables, London, 1938.
> > 
> > Richard Durstenfeld, Algorithm 235: Random permutation, CACM 7(7):420,
> > July 1964.
> 
> The Fisher & Yates or Durstenfeld algos are just particular implementations
> of a ramdom permutation generator.
> R has the sample() function to generate a random permutation. The man page
> doesn't specify what implementation is used but does that really matter?

ups, you're right, i was so focused in finding something specific for
DNA in Biostrings that i forgot to try to use the sample() function, how
embarrassing.. X-p

thanks!
robert.

> > 
> > i've been googling bioconductor and the r-project for this, and
> > particularly looking at the Biostrings package, but I've failed to find
> > anything. just in case i'm missing some implementation of this DNA
> > shuffling in R, i'd like to ask the list whether anybody knows of that
> > procedure being implemented in R or, even better, in a bioconductor
> > package such that would work with DNAString objects.
> > 
> > the reason for that is that i would like to simulate random DNA strings
> > preserving nucleotide composition in order to calculate empirical
> > p-values for motif scanning.
> 
> Generate a random DNA string (of length 1000) with specific
> nucleotide probabilities:
> 
>    x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE,
>               prob=c(0.2, 0.55, 0.1, 0.15)), collapse="")
>    x <- DNAString(x)
> 
> Then:
> 
>    > alphabetFrequency(x, baseOnly=TRUE)
>        A     C     G     T other
>      214   531   110   145     0
> 
> If you really want to shuffle a given DNAString object x0 in order to
> preserve its nucleotide composition:
> 
>    x1 <- x0[sample(length(x0))]
> 
> For example, shuffling the 'x' obtained above gives:
> 
>    x1 <- x[sample(length(x))]
> 
>    > alphabetFrequency(x1, baseOnly=TRUE)
>        A     C     G     T other
>      214   531   110   145     0
> 
> Performance:
> 
>    > system.time(for (i in 1:10000) {y <- x[sample(length(x))]})
>       user  system elapsed
>      5.760   0.008   5.764
> 
> Everything is already here and probably fast enough. No need to
> re-implement anything.
> 
> Cheers,
> H.
> 
> 
> > 
> > if it is not yet implemented i would like to suggest the package
> > maintainers of Biostrings or any other biological sequence
> > infrastructure package in Bioconductor to have it implemented in C for
> > having maximum speed with this kind of simulations. just in case it
> > eases my request here goes some simple R code implementing the
> > Fisher-Yates shuffling algorithm:
> > 
> > fy <- function(seq) {
> >   seq <- unlist(strsplit(seq, ""))
> >   n <- length(seq)
> >   i <- n
> >   while (i > 0) {
> >     j <- floor(runif(1)*i+1)
> >     if (i != j) {
> >       tmp <- seq[i]
> >       seq[i] <- seq[j]
> >       seq[j] <- tmp
> >     }
> >     i <- i - 1
> >   }
> >   paste(seq, collapse="")
> > }
> > 
> > and this is an example of how to use it:
> > 
> > # generate some random DNA sequence of length 3
> > seq <- paste(sample(c("A","C","G","T"), size=3, replace=TRUE), collapse="")
> > seq
> > [1] "ATG"
> > 
> > # shuffle the sequence 10,000 times and store all the permutations
> > shuffledseqs <- c()
> > for (i in 1:10000) shuffledseqs <- c(shuffledseqs, fy(seq))
> > 
> > # verify that indeed the permutations occur uniformly at random
> > base::table(shuffledseqs)/10000
> > shuffledseqs
> >    AGT    ATG    GAT    GTA    TAG    TGA 
> > 0.1693 0.1682 0.1678 0.1629 0.1610 0.1708 
> >  1/6
> > [1] 0.1666667
> > 
> > 
> > thanks!
> > robert.
> > 
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>