[BioC] List all alignments in Biostrings
Hervé Pagès
hpages at fhcrc.org
Tue Mar 25 21:06:14 CET 2014
On 03/25/2014 11:24 AM, Alpesh Querer wrote:
> Thank you Hervé . so if I want to determine to determine number of
> local alignments between two strings no less than length 4, is that
> possible in biostrings?
By "local alignments no less than length 4" do you mean "common
substrings of length >= 4"?
> The alternative is create sliding window fragments from one sequence
> and use matchPattern ?
> subject1: TTAAACGTAAAATGAATAAGCCT
> subject:2 GGAAACGTGAATTGAATAAGCG
Yes you could try doing something like that. Create the views (aka
sliding window) on the first sequence:
s1 <- DNAString("TTAAACGTAAAATGAATAAGCCT")
s1views <- Views(s1, IRanges(seq_len(length(s1)-3), width=4))
Then use matchPDict() instead of matchPattern(). This will be faster
than calling matchPattern() on each view:
m <- matchPDict(PDict(s1views), s2)
start_in_s1 <- rep(start(s1views), elementLengths(m))
start_in_s2 <- unlist(start(m))
seq <- as.character(s1views[start_in_s1])
hits0 <- data.frame(start_in_s1, start_in_s2, width=4L, seq)
Then:
> hits0
start_in_s1 start_in_s2 width seq
1 3 3 4 AAAC
2 4 4 4 AACG
3 5 5 4 ACGT
4 13 8 4 TGAA
5 13 13 4 TGAA
6 14 9 4 GAAT
7 14 14 4 GAAT
8 15 15 4 AATA
9 16 16 4 ATAA
10 17 17 4 TAAG
11 18 18 4 AAGC
Hits that are at consecutive positions in both 'start_in_s1' and
'start_in_s2' should probably be merged in a single longer hit
(they clearly correspond to the same hit). This can be done with
something like:
mergeHits <- function(hits)
{
m <- IRanges:::matchIntegerPairs(hits$start_in_s1 - 1L,
hits$start_in_s2 - 1L,
hits$start_in_s1,
hits$start_in_s2)
merge_idx <- which(!is.na(m))
if (length(merge_idx) == 0L)
return(hits)
## This for loop can be very slow if there are a lot of hits
## to merge. There must be a better way!
for (i in merge_idx) {
mi <- m[i]
mj <- m[mi]
if (!is.na(mj))
m[i] <- mj
}
hits$width <- hits$width + tabulate(m, nbins=length(m))
hits[-merge_idx, c("start_in_s1", "start_in_s2", "width")]
}
hits <- mergeHits(hits0)
Then:
> hits
start_in_s1 start_in_s2 width
1 3 3 6
4 13 8 5
5 13 13 9
Putting back the hit sequences (they can be extracted either from
s1 or s2, it doesn't matter, that should give us exactly the same
sequences):
hits$seq <- as.character(Views(s1, IRanges(hits$start_in_s1,
width=hits$width)))
Then:
> hits
start_in_s1 start_in_s2 width seq
1 3 3 6 AAACGT
4 13 8 5 TGAAT
5 13 13 9 TGAATAAGC
HTH,
H.
>
>
> On Tue, Mar 25, 2014 at 1:06 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Al,
>
>
> On 03/25/2014 09:31 AM, Alpesh Querer wrote:
>
> Hello,
>
> I understand the pairwiseAlignment() returns the (first) best
> match only,
> is there a way for it to return all alignments for 2 sequences
> both local
> and global?
>
>
> Short answer is no.
>
> Long answer: I'm not sure what you'd like to get in case of a global
> alignment ("global" is short for "global-global").
> If you're doing "global-local" alignment, note that pairwiseAlignment()
> returns the *last* best match, not the first:
>
> pattern <- DNAString("AAA")
> subject <- DNAString("TTAAACGTAAAATGAAT")
>
> Then:
>
> > pairwiseAlignment(pattern, subject, type="global-local")
> Global-Local PairwiseAlignmentsSingleSubjec__t (1 of 1)
> pattern: [1] AAA
> subject: [10] AAA
> score: 5.945268
>
> You can use matchPattern() to get all the matches that satisfy
> some criteria (but this criteria must be expressed in terms of
> edit distance):
>
> > matchPattern(pattern, subject, max.mismatch=1, with.indels=TRUE)
> Views on a 17-letter DNAString subject
> subject: TTAAACGTAAAATGAAT
> views:
> start end width
> [1] 3 5 3 [AAA]
> [2] 9 11 3 [AAA]
> [3] 10 12 3 [AAA]
> [4] 15 16 2 [AA]
>
> This doesn't give you a score though nor does it guarantee that you'll
> get a match. However, if the returned Views object is not empty, then
> it's guaranteed to contain all the "best" matches. Even though the Views
> object can be post-processed to keep the best matches only, there is no
> way to ask matchPattern() something like "give me the best matches
> only".
>
> HTH,
> H.
>
>
> Thanks,
> Al
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list