[R] how to group a large list of strings into categories based on string similarity?

Thu Jun 24 04:46:30 CEST 2010

On 06/23/2010 06:55 PM, G FANG wrote:
> Hi,
> 
> I want to group a large list (20 million) of strings into categories
> based on string similarity?
> 
> The specific problem is: given a list of DNA sequence as below
> 
> ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
> ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
> CAGGATCATGCTGCGCGCGAACGGCGGGAGT
> CAGGATCATGCTGCGCGCGAANNNNNNNNNN
> CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
> ......
> .....
> NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
> NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
> NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
> NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
> NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
> NNNNNNNNNNTTCGCGCGCAGCATGATCCTG
> 
> 'N' is the missing letter
> 
> It can be seen that some strings are the same except for those N's
> (i.e. N can match with any base)
> 
> given this list of string, I want to have
> 
> 1) a vector corresponding to each row (string), for each string assign
> an id, such that similar strings (those only differ at N's) have the
> same id
> 2) also get a mapping list from unique strings ('unique' in term of
> the same similarity defined above) to the ids
> 
> I am a matlab user shifting to R. Please advice on efficient ways to do this.

The Bioconductor Biostrings package has many tools for this sort of
operation. See http://bioconductor.org/packages/release/Software.html

Maybe a one-time install

   source('http://bioconductor.org/biocLite.R')
   biocLite('Biostrings')

then

  library(Biostrings)
  x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
        "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
        "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
        "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
        "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
        "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
        "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
  names(x) <- seq_along(x)
  dna <- DNAStringSet(x)
  while (!all(width(dna) ==
              width(dna <- trimLRPatterns("N", "N", dna)))) {}
  names(dna)[rank(dna)]

although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also,
your sequences likely come from a fasta file (Biostrings::readFASTA) or
a text file with a column of sequences (ShortRead::readXStringColumns)
or from alignment software (ShortRead::readAligned /
ShortRead::readFastq). If you go this route you'll want to address
questions to the Bioconductor mailing list

  http://bioconductor.org/docs/mailList.html

Martin

> Thanks!
> 
> Gang
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793