[BioC] Fisher-Yates algorithm for DNA shuffling ?
Hervé Pagès
hpages at fhcrc.org
Wed Jun 24 01:05:34 CEST 2009
Hi Robert,
Robert Castelo wrote:
> dear list,
>
> i'm interested in using some proper DNA shuffling procedure like the
> Fisher-Yates algorithm:
>
> R. A. Fisher and F. Yates, Example 12, Statistical Tables, London, 1938.
>
> Richard Durstenfeld, Algorithm 235: Random permutation, CACM 7(7):420,
> July 1964.
The Fisher & Yates or Durstenfeld algos are just particular implementations
of a ramdom permutation generator.
R has the sample() function to generate a random permutation. The man page
doesn't specify what implementation is used but does that really matter?
>
> i've been googling bioconductor and the r-project for this, and
> particularly looking at the Biostrings package, but I've failed to find
> anything. just in case i'm missing some implementation of this DNA
> shuffling in R, i'd like to ask the list whether anybody knows of that
> procedure being implemented in R or, even better, in a bioconductor
> package such that would work with DNAString objects.
>
> the reason for that is that i would like to simulate random DNA strings
> preserving nucleotide composition in order to calculate empirical
> p-values for motif scanning.
Generate a random DNA string (of length 1000) with specific
nucleotide probabilities:
x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE,
prob=c(0.2, 0.55, 0.1, 0.15)), collapse="")
x <- DNAString(x)
Then:
> alphabetFrequency(x, baseOnly=TRUE)
A C G T other
214 531 110 145 0
If you really want to shuffle a given DNAString object x0 in order to
preserve its nucleotide composition:
x1 <- x0[sample(length(x0))]
For example, shuffling the 'x' obtained above gives:
x1 <- x[sample(length(x))]
> alphabetFrequency(x1, baseOnly=TRUE)
A C G T other
214 531 110 145 0
Performance:
> system.time(for (i in 1:10000) {y <- x[sample(length(x))]})
user system elapsed
5.760 0.008 5.764
Everything is already here and probably fast enough. No need to
re-implement anything.
Cheers,
H.
>
> if it is not yet implemented i would like to suggest the package
> maintainers of Biostrings or any other biological sequence
> infrastructure package in Bioconductor to have it implemented in C for
> having maximum speed with this kind of simulations. just in case it
> eases my request here goes some simple R code implementing the
> Fisher-Yates shuffling algorithm:
>
> fy <- function(seq) {
> seq <- unlist(strsplit(seq, ""))
> n <- length(seq)
> i <- n
> while (i > 0) {
> j <- floor(runif(1)*i+1)
> if (i != j) {
> tmp <- seq[i]
> seq[i] <- seq[j]
> seq[j] <- tmp
> }
> i <- i - 1
> }
> paste(seq, collapse="")
> }
>
> and this is an example of how to use it:
>
> # generate some random DNA sequence of length 3
> seq <- paste(sample(c("A","C","G","T"), size=3, replace=TRUE), collapse="")
> seq
> [1] "ATG"
>
> # shuffle the sequence 10,000 times and store all the permutations
> shuffledseqs <- c()
> for (i in 1:10000) shuffledseqs <- c(shuffledseqs, fy(seq))
>
> # verify that indeed the permutations occur uniformly at random
> base::table(shuffledseqs)/10000
> shuffledseqs
> AGT ATG GAT GTA TAG TGA
> 0.1693 0.1682 0.1678 0.1629 0.1610 0.1708
> 1/6
> [1] 0.1666667
>
>
> thanks!
> robert.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list