[BioC] Group millions of the same DNA sequences?
    Xiaohui Wu 
    wux3 at muohio.edu
       
    Tue Nov 16 11:46:13 CET 2010
    
    
  
Hi all,
I have millions like 100M DNA reads each of which is ~150nt, some of them are duplicate. Is there any way to group the same sequences into one and count the number, like unique() function in R, but with the occurrence of read and also more efficient? 
Also, if I want to cluster these 100M  reads based on their similarity, like editor distance or some distance <=2, is there some function or package can be used? 
Thank you!
Xiaohui
    
    
More information about the Bioconductor
mailing list