[BioC] GappedAlignmentPairs requests
Hervé Pagès
hpages at fhcrc.org
Wed Jul 11 01:22:46 CEST 2012
Hi Michael,
On 07/10/2012 03:45 AM, Michael Lawrence wrote:
>
>
> On Mon, Jul 9, 2012 at 11:47 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Michael,
>
>
> On 06/29/2012 05:06 AM, Michael Lawrence wrote:
>
> Hi (Herve),
>
> I have been playing around with GappedAlignmentPairs and found
> it pretty
> useful. Some things that would be nice:
>
> 1) A method for getting the observed fragment length (would need
> to be NA
> when pairs are on different chromosomes). This should at least
> have an
> option (if not always) to exclude the N cigar runs from the
> calculation. Or
> it could just take it from the TLEN (isize) column in the BAM
> file (which
> at least in the case of gsnap excludes the Ns).
>
>
> From the SAM Spec:
>
> TLEN: signed observed Template LENgth. If all segments are mapped to
> the same reference, the unsigned observed template length equals the
> number of bases from the leftmost mapped base to the rightmost mapped
> base. [...] It is set as 0 for single-segment template or when the
> information is unavailable.
>
> 5 notes:
>
> (1) It seems that most of the time this information is not available
> in the BAM files produced by the aligners (at least not in those
> I've seen so far).
>
> (2) I wonder if the above definition really makes sense for a
> paired-end read with the 2 ends aligned to the *same* strand.
> It's actually not clear to me how to interpret such a situation
> and why the aligner is producing this kind of paired alignments.
> Seems to reveal a template coming from a transcript that mixes
> exons from both strands.
>
>
>
> Yes, for non-concordant pairs, like ends that do not map to opposite
> strands, or that map to different chromosomes or very far away on the
> same chromosome, this would not be easy to calculate. Presumably, there
> is some sort of genomic rearrangement or trans-splicing, and the precise
> break-point would need to be determined.
>
> (3) Taking "the number of bases from the leftmost mapped base
> to the rightmost mapped base" is ok most of the times but is
> questionable in some situations e.g. when the first segment
> is not mapped upstream of the last segment. I'd rather set
> TLEN to NA in that case too, like when the 2 segments are not
> mapped to the same reference, or when they are mapped to the
> same strand.
>
>
> This makes sense to me.
>
> (4) The above is just a definition so it is what it is (and is not
> inherently correct or incorrect) but it seems to me that
> excluding the N gaps would be closer to the intuitive notion
> of "observed template length".
>
> (5) The above definition ignores soft clipping.
>
>
> Right. The definition is not that helpful, and Tom has already decided
> to violate it with gsnap, which counts the soft-clipped regions and
> discounts the N regions when populating the TLEN field. So we're
> currently happy just taking TLEN from the BAM file.
>
> So here is a proposal that ignores the N gaps but takes into account
> soft clipping. 'galp' is GappedAlignmentPairs object. Note that we
> don't need to deal with the situation where the 2 ends are not
> mapped to the same reference, or are mapped to the same strand, because
> those pairs are discarded (with a warning) when the GappedAlignmentPairs
> object is made (usually with readGappedAlignmentPairs()):
>
> otlen <- function(galp)
> {
> lgal <- left(galp)
> rgal <- right(galp)
> lend <- end(lgal)
> rstart <- start(rgal)
> ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L
>
> ## Set to NA when the first and last ends are not in an
> ## upstream/downstream configuration:
> not_upstream_downstream <- rstart <= lend
> ans[not_upstream_downstream] <- NA_integer_
>
> ans
> }
>
> If we want to be a little less conservative about the first and
> last ends being in an upstream/downstream configuration, we
> can replace the computation of 'not_upstream_downstream' with:
>
> lstart <- start(lgal)
> rend <- end(rgal)
> not_upstream_downstream <- rstart < lstart | rend < lend
>
>
>
> 2) A method called "introns" or something that returns the N
> regions as a
> GRangesList, as a complement to grglist().
>
>
> The special gap between the 2 ends is a little bit problematic.
> How should we deal with it given that (1) it has a different
> meaning than the N gaps (i.e. it does not in general correspond to an
> intron, only by chance), and (2) it's not always here (e.g. when
> the first and last ends are not in an upstream/downstream
> configuration)? Of course (2) could always be handled by removing
> the problematic alignment pairs.
>
>
> Well, my request was to discard the inter-read gap, so that I would just
> have the splices as called by the aligner.
I see. Then this can be done with something like this:
## First we need an optimized mendoapply(c, x, y) for CompressedList
## objects (probably worth adding to IRanges, could also be called
## elementCombine() and take an arbitrary number of objects).
pcombine <- function(x, y)
{
if (!is(x, "CompressedList") ||
class(x) != class(y) ||
length(x) != length(y))
stop("'x' and 'y' must be CompressedList objects ",
"of the same class and length")
tmp <- c(x, y)[GenomicRanges:::.makePickupIndex(length(x))]
ans <- GenomicRanges:::.shrinkByHalf(tmp)
elementMetadata(ans) <- NULL
ans
}
## Then the introns() function itself for GappedAlignmentPairs.
introns <- function(x)
{
if (!is(x, "GappedAlignmentPairs"))
stop("'x' must be a GappedAlignmentPairs object")
## Extract introns (i.e. N gaps) from the first ends.
Fgal <- first(x)
Fgrl <- grglist(Fgal, order.as.in.query=TRUE)
Fintrons <- psetdiff(granges(Fgal), Fgrl)
## Extract introns (i.e. N gaps) from the last ends.
Lgal <- last(x, invert.strand=TRUE)
Lgrl <- grglist(Lgal, order.as.in.query=TRUE)
Lintrons <- psetdiff(granges(Lgal), Lgrl)
## Combine the introns from first and last ends into a
## single GRangesList object.
ans <- pcombine(Fintrons, Lintrons)
names(ans) <- names(x)
ans
}
Try it:
my_introns <- introns(galp)
stopifnot(identical(elementLengths(my_introns), ngap(first(galp)) +
ngap(last(galp))))
> It would be nice to have
> another function that would return the inter-read gap. I actually have
> functions that do all these things, but they're not very elegant or
> robust to all of these situations.
>
>
> Such a method would also be nice
> on GappedAlignments.
^^^^^^^^^^^^^^^^
>
>
> This one is easier ('gal' being a GappedAlignments object):
^^^^^^^^^^^^^^^^
>
> psetdiff(granges(gal), grglist(gal))
>
> Also I think you actually sent us a "gaps" method for GRangesList
> recently that we are about to add to the GenomicRanges package.
> Once we have it, the user will be able to do:
>
> gaps(grglist(gal))
>
> as an equivalent way to do the above.
>
>
> Right, but as I said, we want to ignore the inter-read gap, which takes
> a bit more work. Right now, I do the above and then setdiff off the
> inter-read gap.
The above is for a GappedAlignments object so there is no inter-read
gap.
HTH,
H.
>
> Thanks,
>
> H.
>
>
> Thanks,
> Michael
>
> [[alternative HTML version deleted]]
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list