[BioC] GappedAlignmentPairs requests
Hervé Pagès
hpages at fhcrc.org
Tue Jul 10 08:47:24 CEST 2012
Hi Michael,
On 06/29/2012 05:06 AM, Michael Lawrence wrote:
> Hi (Herve),
>
> I have been playing around with GappedAlignmentPairs and found it pretty
> useful. Some things that would be nice:
>
> 1) A method for getting the observed fragment length (would need to be NA
> when pairs are on different chromosomes). This should at least have an
> option (if not always) to exclude the N cigar runs from the calculation. Or
> it could just take it from the TLEN (isize) column in the BAM file (which
> at least in the case of gsnap excludes the Ns).
From the SAM Spec:
TLEN: signed observed Template LENgth. If all segments are mapped to
the same reference, the unsigned observed template length equals the
number of bases from the leftmost mapped base to the rightmost mapped
base. [...] It is set as 0 for single-segment template or when the
information is unavailable.
5 notes:
(1) It seems that most of the time this information is not available
in the BAM files produced by the aligners (at least not in those
I've seen so far).
(2) I wonder if the above definition really makes sense for a
paired-end read with the 2 ends aligned to the *same* strand.
It's actually not clear to me how to interpret such a situation
and why the aligner is producing this kind of paired alignments.
Seems to reveal a template coming from a transcript that mixes
exons from both strands.
(3) Taking "the number of bases from the leftmost mapped base
to the rightmost mapped base" is ok most of the times but is
questionable in some situations e.g. when the first segment
is not mapped upstream of the last segment. I'd rather set
TLEN to NA in that case too, like when the 2 segments are not
mapped to the same reference, or when they are mapped to the
same strand.
(4) The above is just a definition so it is what it is (and is not
inherently correct or incorrect) but it seems to me that
excluding the N gaps would be closer to the intuitive notion
of "observed template length".
(5) The above definition ignores soft clipping.
So here is a proposal that ignores the N gaps but takes into account
soft clipping. 'galp' is GappedAlignmentPairs object. Note that we
don't need to deal with the situation where the 2 ends are not
mapped to the same reference, or are mapped to the same strand, because
those pairs are discarded (with a warning) when the GappedAlignmentPairs
object is made (usually with readGappedAlignmentPairs()):
otlen <- function(galp)
{
lgal <- left(galp)
rgal <- right(galp)
lend <- end(lgal)
rstart <- start(rgal)
ans <- qwidth(lgal) + qwidth(rgal) + rstart - lend - 1L
## Set to NA when the first and last ends are not in an
## upstream/downstream configuration:
not_upstream_downstream <- rstart <= lend
ans[not_upstream_downstream] <- NA_integer_
ans
}
If we want to be a little less conservative about the first and
last ends being in an upstream/downstream configuration, we
can replace the computation of 'not_upstream_downstream' with:
lstart <- start(lgal)
rend <- end(rgal)
not_upstream_downstream <- rstart < lstart | rend < lend
>
> 2) A method called "introns" or something that returns the N regions as a
> GRangesList, as a complement to grglist().
The special gap between the 2 ends is a little bit problematic.
How should we deal with it given that (1) it has a different
meaning than the N gaps (i.e. it does not in general correspond to an
intron, only by chance), and (2) it's not always here (e.g. when
the first and last ends are not in an upstream/downstream
configuration)? Of course (2) could always be handled by removing
the problematic alignment pairs.
> Such a method would also be nice
> on GappedAlignments.
This one is easier ('gal' being a GappedAlignments object):
psetdiff(granges(gal), grglist(gal))
Also I think you actually sent us a "gaps" method for GRangesList
recently that we are about to add to the GenomicRanges package.
Once we have it, the user will be able to do:
gaps(grglist(gal))
as an equivalent way to do the above.
Thanks,
H.
>
> Thanks,
> Michael
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list