[BioC] Biostring: print sequence alignment to file
Martin Preusse
martin.preusse at googlemail.com
Thu Jul 19 13:30:04 CEST 2012
Hi Hervé,
thanks! This looks great. I have a question though: How can I install the latest version of Biostrings?
When I install with:
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
I get Version 2.22.0 … ;)
Cheers
Martin
Am Donnerstag, 19. Juli 2012 um 08:10 schrieb Hervé Pagès:
> Hi Martin,
>
> On 06/22/2012 12:09 PM, Hervé Pagès wrote:
> > Hi Martin,
> >
> > On 06/14/2012 06:55 AM, Martin Preusse wrote:
> > > Hi guys,
> > >
> > > anything new on the sequence output? Maybe I missed something :)
> > > please tell me if you need testing etc.
> >
> >
> >
> > Still on my list. Will work on this in the next couple of weeks. I'll
> > let you know. Thanks for the reminder.
>
>
>
> There is now a writePairwiseAlignments() function (in Biostrings 2.25.8)
> for doing this. It produces a file in the "pair" format (as described on
> the EMBOSS website, at the URL you sent earlier):
>
> > library(Biostrings)
> > pattern <- DNAString("CGTACGTAACGTTCGT")
> > subject <- DNAString("CGTCGTCGTCCGTAA")
> > x1 <- pairwiseAlignment(pattern, subject)
>
>
> > x1
> Global PairwiseAlignmentsSingleSubject (1 of 1)
> pattern: [1] CGTACGTAACGTTCGT
> subject: [1] CGT-CGT--CGTCCGT
> score: -32.11822
>
> > writePairwiseAlignments(x1, block.width=10) # 'block.width'
> default is 50
> ########################################
> # Program: Biostrings (version 2.25.8), a Bioconductor package
> # Rundate: Wed Jul 18 23:05:46 2012
> ########################################
> #=======================================
> #
> # Aligned_sequences: 2
> # 1: P1
> # 2: S1
> # Matrix: NA
> # Gap_penalty: 14.0
> # Extend_penalty: 4.0
> #
> # Length: 18
> # Identity: 12/18 (66.7%)
> # Similarity: NA/18 (NA%)
> # Gaps: 5/18 (27.8%)
> # Score: -32.11822
> #
> #
> #=======================================
>
> P1 1 CGTACGTAAC 10
> ||| ||| |
> S1 1 CGT-CGT--C 7
>
> P1 11 GTTCGT-- 16
> || |||
> S1 8 GTCCGTAA 15
>
>
> #---------------------------------------
> #---------------------------------------
>
>
> Only lightly tested. Not necessarily very performant (no C code). Please
> have a look at the man page for some caveats (especially if you plan to
> use it on NON global alignments). Feedback welcome.
>
> Thanks,
> H.
>
> >
> > H.
> >
> > >
> > > Cheers
> > > Martin
> > >
> > >
> > > Am Samstag, 21. April 2012 um 11:55 schrieb Martin Preusse:
> > >
> > > > Hi Hervé,
> > > >
> > > > thanks for your help! If you need suggestions, help or testing, just
> > > > say the word.
> > > >
> > > > Will you implement the header also? If you do so, I would be thankful
> > > > for an option like "header=F" for the output.
> > > >
> > > >
> > > > Cheers
> > > > Martin
> > > >
> > > >
> > > > Am Samstag, 21. April 2012 um 02:12 schrieb Hervé Pagès:
> > > >
> > > > > Thanks Martin and Thomas for the useful feedback. The 'pair' and
> > > > > 'markx0' formats supported by Emboss seem indeed appropriate for
> > > > > printing the output of pairwiseAlignment() to a file. I'll add
> > > > > support for those 2 formats in Biostrings. Won't be before 1 week
> > > > > or 2 though...
> > > > >
> > > > > Cheers,
> > > > > H.
> > > > >
> > > > > On 04/18/2012 03:20 AM, Martin Preusse wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I just found this function to print a pairwise alignments in
> > > > > > blocks. Doesn't add the match/mismatch indicators between
> > > > > > sequences, but might be a starting point:
> > > > > >
> > > > > > http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html#viewing-a-long-pairwise-alignment
> > > > > >
> > > > > >
> > > > > >
> > > > > > Cheers
> > > > > > Martin
> > > > > >
> > > > > >
> > > > > >
> > > > > > Am Mittwoch, 18. April 2012 um 12:16 schrieb Martin Preusse:
> > > > > >
> > > > > > > Hi everybody,
> > > > > > >
> > > > > > > I think the output format depends on the purpose of the alignment.
> > > > > > >
> > > > > > > A pairwise sequence alignment is usually done to compare two
> > > > > > > sequences base by base. In my case, I compare sequencing results
> > > > > > > of cloned expression constructs with the desired sequence. Thus,
> > > > > > > the best output format would be "BLAST like".
> > > > > > >
> > > > > > > seq1: 1 ATCTGC 7
> > > > > > > | | | . . |
> > > > > > > seq2: 1 ATCAAC 7
> > > > > > >
> > > > > > > When doing MSA, most people might rather be interested in the
> > > > > > > consensus sequence. E.g. in the context of conservation between
> > > > > > > species.
> > > > > > >
> > > > > > > So write.PairwiseAlignedXStringSet() and write.MultipleAlignment()
> > > > > > > are quite different and BLAST doesn't make much sense for multiple
> > > > > > > alignments. This means it would be best to put the output in the
> > > > > > > PairwiseAlignment/MultipleAlignment and not to the XStringSet, right?
> > > > > > >
> > > > > > > This is an overview of sequence alignment formats used by EMBOSS:
> > > > > > > http://emboss.sourceforge.net/docs/themes/AlignFormats.html
> > > > > > >
> > > > > > > 'pair' or 'markx0' would be perfectly fine.
> > > > > > >
> > > > > > >
> > > > > > > Cheers
> > > > > > > Martin
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Am Dienstag, 17. April 2012 um 22:13 schrieb Thomas Girke:
> > > > > > >
> > > > > > > > Hi Hervé,
> > > > > > > >
> > > > > > > > To me, the most basic and versatile MSA or pairwise alignment
> > > > > > > > format to output
> > > > > > > > to would be FASTA since it is compatible with almost any other
> > > > > > > > alignment
> > > > > > > > editing software. For text-based viewing purposes my preference
> > > > > > > > would be
> > > > > > > > to also output to a format similar to the one shown in the following
> > > > > > > > example. When there are only two sequences then one could show
> > > > > > > > instead
> > > > > > > > of a consensus line the pipe characters between the two sequences to
> > > > > > > > indicate identical residues which mimics the blast output. A more
> > > > > > > > standardized version of this pairwise alignment format can be found
> > > > > > > > here:
> > > > > > > > http://emboss.sourceforge.net/apps/cvs/emboss/apps/needle.html
> > > > > > > >
> > > > > > > > library(Biostrings)
> > > > > > > > p450<-
> > > > > > > > read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul",
> > > > > > > > "fasta")
> > > > > > > >
> > > > > > > > StringSet2html<- function(msa=p450, file="p450.html", start=1,
> > > > > > > > end=length(p450[[1]]), counter=20, browser=TRUE, ...) {
> > > > > > > > if(class(msa)=="AAStringSet") msa<- AAStringSet(msa, start=start,
> > > > > > > > end=end)
> > > > > > > > if(class(msa)=="DNAStringSet") msa<- DNAStringSet(msa,
> > > > > > > > start=start, end=end)
> > > > > > > > msavec<- sapply(msa, toString)
> > > > > > > > offset<- (counter-1)-nchar(nchar(msavec[1]))
> > > > > > > > legend<- paste(paste(paste(paste(rep(" ", offset), collapse=""),
> > > > > > > > format(seq(0,
> > > > > > > > nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="")
> > > > > > > > consensus<- consensusString(msavec, ambiguityMap=".", ...)
> > > > > > > > msavec<- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ")
> > > > > > > > msavec<- paste(format(c("", names(msa), "Consensus"),
> > > > > > > > justify="left"), c(legend, msavec,
> > > > > > > > consensus), sep=" ")
> > > > > > > > msavec<- c("<html><pre>", msavec,"</pre></html>")
> > > > > > > > writeLines(msavec, file)
> > > > > > > > if(browser==TRUE) { browseURL(file) }
> > > > > > > > }
> > > > > > > > StringSet2html(msa=p450, file="p450.html", start=1,
> > > > > > > > end=length(p450[[1]]), counter=20, browser=T, threshold=1.0)
> > > > > > > > StringSet2html(msa=p450, file="p450.html", start=450, end=470,
> > > > > > > > counter=20, browser=T, threshold=1.0)
> > > > > > > >
> > > > > > > >
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > > On Tue, Apr 17, 2012 at 07:43:30PM +0000, Hervé Pagès wrote:
> > > > > > > > > Hi Thomas,
> > > > > > > > >
> > > > > > > > > On 04/17/2012 11:49 AM, Thomas Girke wrote:
> > > > > > > > > > What about providing an option in pairwiseAlignment to output
> > > > > > > > > > to the
> > > > > > > > > > MultipleAlignment class in Biostrings and then write the latter to
> > > > > > > > > > different alignment formats?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Or we could provide coercion methods to switch between
> > > > > > > > > PairwiseAlignedXStringSet and MultipleAlignment.
> > > > > > > > >
> > > > > > > > > Anyway that kind of moves Martin's problem from having a
> > > > > > > > > write.PairwiseAlignedXStringSet() function that produces BLAST
> > > > > > > > > output
> > > > > > > > > to having a write.MultipleAlignment() function that produces BLAST
> > > > > > > > > output. For the specific case of BLAST output, would it make sense
> > > > > > > > > to support it for MultipleAlignment? Can someone point me to an
> > > > > > > > > example
> > > > > > > > > of such output? Or even better, to the specs of such format?
> > > > > > > > >
> > > > > > > > > Note that right now there is the write.phylip() function in
> > > > > > > > > Biostrings
> > > > > > > > > for writing a MultipleAlignment object to a file but the Phylip
> > > > > > > > > format
> > > > > > > > > looks very different from the BLAST output:
> > > > > > > > >
> > > > > > > > > hpages at latitude:~$ head -n 20 phylip_test.txt
> > > > > > > > > 9 2343
> > > > > > > > > Mask 0000000000 0000000000 0000000000 0000000000 0000000000
> > > > > > > > > Human -----TCCCG TCTCCGCAGC AAAAAAGTTT GAGTCGCCGC TGCCGGGTTG
> > > > > > > > > Chimp ---------- ---------- ---------- ---------- ----------
> > > > > > > > > Cow ---------- ---------- ---------- ---------- ----------
> > > > > > > > > Mouse ---------- ---------- --AAAAGTTG GAGTCTTCGC TTGAGAGTTG
> > > > > > > > > Rat ---------- ---------- ---------- ---------- ----------
> > > > > > > > > Dog ---------- ---------- ---------- ---------- ----------
> > > > > > > > > Chicken ---------- ----CGGCTC CGCAGCGCCT CACTCGCGCA GTCCCCGCGC
> > > > > > > > > Salmon GGGGGAGACT TCAGAAGTTG TTGTCCTCTC CGCTGATAAC AGTTGAGATG
> > > > > > > > >
> > > > > > > > > 0000000000 0000000000 0000000000 0001111111 1111111111
> > > > > > > > > CCAGCGGAGT CGCGCGTCGG GAGCTACGTA GGGCAGAGAA GTCA-TGGCT
> > > > > > > > > ---------- ---------- ---------- ---------- ---A-TGGCT
> > > > > > > > > ---------- ---------- ---------- ---GAGAGAA GTCA-TGGCT
> > > > > > > > > CCAGCGGAGT CGCGCGCCGA CAGCTACGCG GCGCAGA-AA GTCA-TGGCT
> > > > > > > > > ---------- ---------- ---------- ---------- ---A-TGGCT
> > > > > > > > > ---------- ---------- ---------- ---------- ---A-TGGCT
> > > > > > > > > AGGGCCGGGC AGAGGCGCAC GCAGCTCCCC GGGCGGCCCC GCTC-CAGCC
> > > > > > > > > CGCATATTAT TATTACCTTT AGGACAAGTT GAATGTGTTC GTCAACATCT
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > H.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > > > On Tue, Apr 17, 2012 at 05:59:24PM +0000, Hervé Pagès wrote:
> > > > > > > > > > > Hi Martin,
> > > > > > > > > > >
> > > > > > > > > > > On 04/16/2012 04:06 AM, Martin Preusse wrote:
> > > > > > > > > > > > Hi Charles,
> > > > > > > > > > > >
> > > > > > > > > > > > thanks! Your solution allows to print the two alignment
> > > > > > > > > > > > strings separately.
> > > > > > > > > > > >
> > > > > > > > > > > > I was thinking of an output as generated by alignment tools:
> > > > > > > > > > > >
> > > > > > > > > > > > AGT-TCTAT
> > > > > > > > > > > > | | | | | | | | |
> > > > > > > > > > > > AGTATCTAT
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > This looks like BLAST output. Is this what you have in mind?
> > > > > > > > > > > Note that
> > > > > > > > > > > there are many alignment tools and many ways to output the
> > > > > > > > > > > result to a
> > > > > > > > > > > file. I'm not really familiar with the BLAST output format. Is it
> > > > > > > > > > > specified somewhere? Would that make sense to add something
> > > > > > > > > > > like a
> > > > > > > > > > > write.PairwiseAlignedXStringSet() function to Biostrings for
> > > > > > > > > > > writing
> > > > > > > > > > > the result of pairwiseAlignment() to a file? We could do this and
> > > > > > > > > > > support the BLAST format if that's a commonly used format.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > H.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > For this I would have to write a function to output the
> > > > > > > > > > > > strings in blocks of e.g. 60 nucleotides, right?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers
> > > > > > > > > > > > Martin
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Am Freitag, 13. April 2012 um 19:21 schrieb Chu, Charles:
> > > > > > > > > > > >
> > > > > > > > > > > > > write.XStringSet
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > _______________________________________________
> > > > > > > > > > > > Bioconductor mailing list
> > > > > > > > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > > > > > > > Search the archives:
> > > > > > > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Hervé Pagès
> > > > > > > > > > >
> > > > > > > > > > > Program in Computational Biology
> > > > > > > > > > > Division of Public Health Sciences
> > > > > > > > > > > Fred Hutchinson Cancer Research Center
> > > > > > > > > > > 1100 Fairview Ave. N, M1-B514
> > > > > > > > > > > P.O. Box 19024
> > > > > > > > > > > Seattle, WA 98109-1024
> > > > > > > > > > >
> > > > > > > > > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> > > > > > > > > > > Phone: (206) 667-5791
> > > > > > > > > > > Fax: (206) 667-1319
> > > > > > > > > > >
> > > > > > > > > > > _______________________________________________
> > > > > > > > > > > Bioconductor mailing list
> > > > > > > > > > > Bioconductor at r-project.org (mailto:Bioconductor at r-project.org)
> > > > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > > > > > > Search the archives:
> > > > > > > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Hervé Pagès
> > > > > > > > >
> > > > > > > > > Program in Computational Biology
> > > > > > > > > Division of Public Health Sciences
> > > > > > > > > Fred Hutchinson Cancer Research Center
> > > > > > > > > 1100 Fairview Ave. N, M1-B514
> > > > > > > > > P.O. Box 19024
> > > > > > > > > Seattle, WA 98109-1024
> > > > > > > > >
> > > > > > > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> > > > > > > > > Phone: (206) 667-5791
> > > > > > > > > Fax: (206) 667-1319
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Hervé Pagès
> > > > >
> > > > > Program in Computational Biology
> > > > > Division of Public Health Sciences
> > > > > Fred Hutchinson Cancer Research Center
> > > > > 1100 Fairview Ave. N, M1-B514
> > > > > P.O. Box 19024
> > > > > Seattle, WA 98109-1024
> > > > >
> > > > > E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> > > > > Phone: (206) 667-5791
> > > > > Fax: (206) 667-1319
> > > >
> > >
> >
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org (mailto:hpages at fhcrc.org)
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioconductor
mailing list