[BioC] Count differences between sequences

Sun Mar 28 22:17:21 CEST 2010

Hi Erik et al.,

Not offering anything practical, but...

On 03/28/2010 11:39 AM, erikwright at comcast.net wrote:
> Hi Patrick, Michael, HervÃ©, and Martin,
> 
> 
> Wow! Thanks very much everyone for putting so much effort into
> answering my question. While we wait for Patrick's update of
> stringDist to become available, I will describe my solution and a
> related problem I have run into.
> 
> 
> I am using this piece of code to do most of the work:
> 
> 
> 
> 
> 
> 
> numF <- length ( myDNAStringSet )
> 
> for ( i in 1 :( numF - 1 )) {
> 
> # use IUPAC_CODE_MAP with fixed=FALSE
> 
> distMatrix [ i ,( i + 1 ): numF ] <- neditStartingAt (
> 
> myDNAStringSet [[ i ]],
> 
> myDNAStringSet [( i + 1 ): numF ],
> 
> fixed = FALSE )
> 
> distMatrix [( i + 1 ): numF , i ] <- distMatrix [ i ,( i + 1 ): numF
> ]
> 
> 
> }
> 
> diag ( distMatrix ) <- 1
> 
> 
> This will populate a matrix with the number of differences between
> strings. The code works reasonably fast - about 15 seconds for a
> DNAStringSet with 500 sequences of length 8,000. A DNAStringSet with
> 8,000 sequences of length 8,000 is exponentially slower - about 45
> minutes. Obviously, if Patrick's update of stringDist improves the
> speed it will greatly help.

Not exponential, the algorithm is polynomial in the number of
comparisons N (N - 1) / 2. I think Patrick's update will help you
through this part of your problem though -- it's still polynomial, but
fast enough that you hit memory limits before unbearable time limits. But...

Here's a little R code that, ignoring the commented out bit for a
moment, groups a single 'column' of nucleotides in linear time -- some
small multiple of N

f0 <- function(dna)
{
    l <- unique(width(dna))
    w <- length(dna)

    dna0 <- factor(unlist(strsplit(as.character(dna), "")))
    idx <- seq(1, l * w, l)  # 'column' offsets
    i <- seq_len(w)

    d <- matrix(l, w, w)
    for (j in seq_len(l)-1L) {
        grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides
##         for (k in grp)
##             d[k, k] = d[k, k] - 1L
    }
    d
}

Internally, 'split' does its work in 2N comparisons. This important part
is that comparisons are accomplished in linear rather than polynomial time.

Each element of 'grp' is an integer vector indexing individuals with the
same nucleotide. The code above assumes one is interested in Hamming
distance. It initializes the distance matrix 'd' so that everyone is
maximally distant from one another. and then in the commented out code
updates the distance by subtracting one from all individuals with the
same nucleotide (i.e., in k). This is unfortunately the slow part of the
revised code, and it is polynomial time -- in the best case, for 4
equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates,
and in the worst case (all nucleotides identical) we do N^2 updates. We
pay a heavy price for doing this at the R level, and basically get
swamped by the update rather than sort cost.

But we shouldn't lose track of the fact that we can do the sorting much
more efficiently. I've also looked a little at sort.list(,
method="radix"), which orders a factor very quickly.

I wonder what the down-stream fate of this distance matrix is, because
perhaps it doesn't need to be create it at all?

> 
> 
> In playing around with this I have run into a related problem that
> you all might be able to help me with. As I stated in my original
> email, I am trying to calculate a Distance Matrix with a large
> aligned DNAStringSet. My DNAStringSet does not contain ideal data,
> meaning there are many gap characters ("-"). If I find the edit
> difference between two strings that have gaps in the same places then
> the gaps are not counted towards the edit distance. This makes
> complete sense because "-" == "-". My problem is that I want to
> include any gap characters in my edit distance. For example:
> 
> 
> 
>> x <- DNAString("--A") y <- DNAString("--A") neditStartingAt(x,y)
> [1] 0
> 
> 
> The distance between these two strings for my distance measurement
> should be 2. That is because the gap character ("-") gives me no
> data, so I cannot say that the distance between the two strings is 0.
> Can anyone think of a good method of counting common gap characters
> toward the edit distance? Unfortunately I cannot simply count up the
> number of gap characters in my pattern and add it to my edit
> distance. For example:
> 
> 
> 
> 
>> x <- DNAString("AC--") y <- DNAString("ACC-") neditStartingAt(x,y)
>> 
> [1] 1
> 
> 
> The distance between these two strings for my distance measurement
> should be 2. If I were to add the number of gap characters in the
> pattern to the edit distance I would get 2 + 1 = 3. So basically I
> want to count any gap character in the pattern or subject toward the
> edit distance because it gives me no information. Does this make
> sense, and does anyone have an idea for how to do this?

Again not being helpful in a practical sense here, but the idea is that
you have a distance matrix between letters in the
Biostrings::IUPAC_CODE_MAP. The implementation Patrick provides is I
think a matrix of 0's and 1's, whereas you'd like support for a
generalized distance matrix. If that matrix is called 'dist', then I
think the following (untested; it needs to not update if
length(grp[[l]]) or length(grp[[m]]) is zero) does the trick...

f0d <- function(dna, dist)
{
    l <- unique(width(dna))
    w <- length(dna)

    dna0 <- factor(unlist(strsplit(as.character(dna), "")))
    idx <- seq(1, l * w, l)
    i <- seq_len(w)

    d <- matrix(0, w, w)
    for (j in seq_len(l)-1L) {
        grp <- split(i, dna0[j+idx])
        for (l in seq_along(grp))
            for (m in seq_along(grp))
                d[grp[[l]], grp[[m]]] <-
                    d[grp[[l]], grp[[m]]] + dist[l, m]
    }
    d
}

Martin

> 
> 
> Thanks again, Erik
> 
> 
> 
> ----- Original Message ----- From: "Patrick Aboyoun"
> <paboyoun at fhcrc.org> To: "Michael Lawrence"
> <lawrence.michael at gene.com> Cc: "BioC list"
> <bioconductor at stat.math.ethz.ch> Sent: Saturday, March 27, 2010
> 7:39:16 PM GMT -06:00 US/Canada Central Subject: Re: [BioC] Count
> differences between sequences
> 
> I just added a Hamming distance metric to the stringDist function in
> BioC 2.6, which should be available on bioconductor.org in 36 hours.
> This uses the code underlying the neditStartingAt functions and loops
> over the lower triangle of the distance matrix in C so it is faster
> than the sapply method Herve provided. To calculate this newly added
> distance measure, specify stringDist(x, method = "hamming") where x
> is an XStringSet object containing equal length strings.
> library(Biostrings) library(drosophila2probe) ch0 <-
> drosophila2probe[seq_len(500*8000/25), "sequence"] ch <-
> unlist(strsplit(ch0, "")) m0 <- matrix(ch, 500) dna <-
> DNAStringSet(apply(m0, 1, paste, collapse="")) system.time(myDists <-
> stringDist(dna, method = "hamming")) # user system elapsed # 5.459
> 0.029 5.541 sessionInfo() R version 2.11.0 Under development
> (unstable) (2010-03-22 r51355) i386-apple-darwin9.8.0 locale: [1]
> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base
> packages: [1] stats graphics grDevices utils datasets methods base
> other attached packages: [1] drosophila2probe_2.5.0
> AnnotationDbi_1.9.6 Biobase_2.7.5 [4] Biostrings_2.15.26
> IRanges_1.5.72 loaded via a namespace (and not attached): [1]
> DBI_0.2-5 RSQLite_0.8-4 tools_2.11.0 Patrick On 3/26/10 1:35 PM,
> Michael Lawrence wrote: > 2010/3/26 Hervï¿½ Pagï¿½s > > >> Hi Erik,
> Patrick, >> >> >> Patrick Aboyoun wrote: >> >> >>> Erik, >>> Judging
> from your data, I would gather that you are not interested in >>>
> indels. Is that correct? You should look at the neditStartingAt
> function. >>> Something like the following may meet your needs: >>>
> >>> N<- length(myStrings) >>> myDists<- matrix(0, nrow = N, ncol = N)
> >>> for (i in 1:(N-1)) >>> for (j in (i+1):N) >>> myDists[i, j]<-
> myDists[j, i]<- neditStartingAt(myStrings[[i]], >>> myStrings[[j]])
> >>> >>> >>> >> Note that you can take advantage of the fact that
> neditStartingAt() is >> vectorized with respect to its second
> argument: >> >> N<- length(myStrings) >> myDists<- sapply(1:N, >>
> function(i) neditStartingAt(myStrings[[i]], myStrings))) >> >> That
> will make things hundred times faster with a big "DNA rectangle" >>
> like yours (500x8000): >> >> >>> system.time( >>> >> myDists<-
> sapply(1:N, >> function(i) neditStartingAt(myStrings[[i]],
> myStrings))) >> user system elapsed >> 12.053 0.000 12.723 >> >> >>>
> myDists[1:10, 1:10] >>> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
> [,9] [,10] >> [1,] 0 5888 6055 5947 6152 6248 6038 6175 6268 6047 >>
> [2,] 5888 0 5849 6113 5926 6148 6117 5956 6167 6204 >> [3,] 6055 5849
> 0 6053 6184 5959 6137 6077 5997 6041 >> [4,] 5947 6113 6053 0 6111
> 6167 5910 6096 6121 5959 >> [5,] 6152 5926 6184 6111 0 6038 6209 5906
> 6019 6194 >> [6,] 6248 6148 5959 6167 6038 0 6085 6112 5924 6137 >>
> [7,] 6038 6117 6137 5910 6209 6085 0 5961 6192 5947 >> [8,] 6175 5956
> 6077 6096 5906 6112 5961 0 5899 6183 >> [9,] 6268 6167 5997 6121 6019
> 5924 6192 5899 0 5984 >> [10,] 6047 6204 6041 5959 6194 6137 5947
> 6183 5984 0 >> >> >> > Wow, nice. It sounds to me like this would be
> a common use case; how hard > would it be to vectorize both
> arguments? > > Michael > > > >> Cheers, >> H. >> >> >> >> >>> Patrick
> >>> >>> >>> On 3/25/10 2:57 PM, erikwright at comcast.net wrote: >>> >>>
> >>>> I have 500 DNAStrings, all of length 8000. I need the entire N x
> N >>>> distance matrix. >>>> >>>> Thanks, >>>> Erik >>>> >>>> >>>>
> >>>> ----- Original Message ----- >>>> From: "Patrick Aboyoun" >>>>
> To: erikwright at comcast.net >>>> Cc: bioconductor at stat.math.ethz.ch
> >>>> Sent: Thursday, March 25, 2010 4:45:29 PM GMT -06:00 US/Canada
> Central >>>> Subject: Re: [BioC] Count differences between sequences
> >>>> >>>> Erik, >>>> Could you provide more details on your data? How
> long are each of the >>>> strings and how many strings do you have?
> Also, do you need the entire N x N >>>> distance matrix for
> downstream analysis or are you just looking for closest >>>>
> relatives? >>>> >>>> >>>> Patrick >>>> >>>> >>>> >>>> On 3/25/10 2:29
> PM, erikwright at comcast.net wrote: >>>> >>>> >>>>> Hello all, >>>>>
> >>>>> >>>>> I have a large DNAStringSet and I am trying to calculate
> its >>>>> >>>>> >>>> distance matrix. My DNAStrings are equal width
> and they are already >>>> aligned. >>>> >>>> >>>>> I have tried using
> the stringDist() function, but it is very slow >>>>> >>>>> >>>> for
> large DNAStringSets. Is there a way to quickly calculate the number
> >>>> of differences between two DNAString instances? >>>> >>>> >>>>>
> For example, let's say I have two DNAStrings: "ACAC" and "ACAG". I
> >>>>> >>>>> >>>> would like to know if their is a function other than
> stringDist() that >>>> will tell me the distance between them is 1.
> >>>> >>>> >>>>> Thanks in advance for any help. >>>>> >>>>> >>>>> -
> Erik >>>>> [[alternative HTML version deleted]] >>>>> >>>>>
> _______________________________________________ >>>>> Bioconductor
> mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>>
> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the
> archives: >>>>> >>>>> >>>>
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>>
> _______________________________________________ >>> Bioconductor
> mailing list >>> Bioconductor at stat.math.ethz.ch >>>
> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the
> archives: >>>
> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>
> >>> >> -- >> Hervï¿½ Pagï¿½s >> >> Program in Computational Biology
> >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer
> Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >>
> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206)
> 667-5791 >> Fax: (206) 667-1319 >> >> >>
> _______________________________________________ >> Bioconductor
> mailing list >> Bioconductor at stat.math.ethz.ch >>
> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the
> archives: >>
> http://news.gmane.org/gmane.science.biology.informatics.conductor >>
> >> > [[alternative HTML version deleted]] > > > > >
> _______________________________________________ > Bioconductor
> mailing list > Bioconductor at stat.math.ethz.ch >
> https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> [[alternative HTML version deleted]] 
> _______________________________________________ Bioconductor mailing
> list Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor 
> [[alternative HTML version deleted]]
> 
> 
> 
> 
> _______________________________________________ Bioconductor mailing
> list Bioconductor at stat.math.ethz.ch 
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793