[BioC] findOverlaps method in GenomicRanges not supporting type="equal" for GRangesList, GRangesList?
Hervé Pagès
hpages at fhcrc.org
Fri Nov 22 22:03:34 CET 2013
Hi Nico,
On 11/22/2013 06:36 AM, Nicolas Delhomme wrote:
> Hej Michael, Hervé!
>
> I have to admit that I’m not exactly following the details of your discussion here. What would appear intuitive for me for a findOverlaps function on GRangesList, GRangesList is that it returns a list of findOverlaps on GRanges,GRanges. I can see difficulties in determining the concordance between the GRanges of the two GRangesList though. And I don’t have the overview to decide whether such a behaviour would be consistent, but that’s naively what I would expect.
Typically when 'query' and 'subject' are GRangesList, one is
representing alignments (e.g. single-end with gaps or paired-end,
thus more than 1 range is needed to represent an alignment), and
the other one is representing a set of compound genomic features
e.g. exons grouped by transcript or by gene.
findOverlaps() then just returns yes/no answers to the questions
"do query[[i]] and subject[[j]] overlap?" for all the possible (i, j)
combinations. It doesn't return any detailed information about which
ranges in GRanges objects query[[i]] and subject[[j]] have overlaps.
If a hit is reported between query[[i]] and subject[[j]], it just
means that at least one range in the former overlaps with one range
in the latter. For many analyses, this is enough information i.e.
knowing that there is a hit between a read and a transcript is enough.
For analyses that need to know a little bit more about the
characteristics of those hits, some post-processing of the Hits object
is generally required. That's what tools like summarizeOverlaps or
findSpliceOverlaps do. In your case, if you want a "list of
findOverlaps on GRanges,GRanges", it's also something that you could
obtained by post-processing the hit object:
ov <- findOverlaps(query, subject, ...)
mapply(findOverlaps, query[queryHits(ov)], subject[subjectHits(ov)])
It probably won't be very efficient but we could address this by
implementing a fast "parallel findOverlaps" (e.g. pfindOverlaps)
if there is interest in something like that.
But back to your original request of supporting type="equal" on
GRangesList objects. When you use the 'type' argument, that only
changes the criteria that findOverlaps uses for deciding whether
to report a hit or not. That doesn't change what is reported.
What is reported is still a Hits object answering the yes/no
answers. When type="equal" will be available, a hit between
query[[i]] and subject[[j]] will be reported only if the ranges
are the same in the 2 objects. In that case there doesn't seem to
be a lot of value in trying to generate a "list of findOverlaps on
GRanges,GRanges" because you already know what you need to know.
So what would be more useful: (1) a fast pfindOverlaps, (2) supporting
type="equal" on GRangesList, (3) both? Knowing a little bit more
about what you are trying to do would help us decide.
FWIW let me mention another way of post-processing a Hits object.
encodeOverlaps() can be used on the Hits object to produce "overlap
encodings". These are small strings that describe how the individual
ranges in the 2 GRanges objects involved in a hit are positioned
with respect to each other. The idea is a little bit the same as
with the CIGAR encodings that provide details about how reads align
to the reference. See the OverlapEncoding vignette in the GenomicRanges
package for more info (it's a low level tool and it can be a little
bit tricky to use, also the vignette is probably too technical and
would need to be improved).
Cheers,
H.
>
> Anyway, just my 2 cts. Thanks for taking a look.
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Nathaniel Street Lab
> Department of Plant Physiology
> Umeå Plant Science Center
>
> Tel: +46 90 786 7989
> Email: nicolas.delhomme at plantphys.umu.se
> SLU - Umeå universitet
> Umeå S-901 87 Sweden
> ---------------------------------------------------------------
>
> On 21 Nov 2013, at 19:43, Michael Lawrence <lawrence.michael at gene.com> wrote:
>
>> So I've checked into devel a match,GRangesList,GRangesList. This allows findMatches() to return what you want. There is a question though before this is approved: does it make sense for match() to act like findOverlaps and consider each GRanges atomically (one returned index per GRanges) or should match behave as it does other Lists and return an IntegerList, with a value per range, grouped by the top-level elements. If we decide on the latter, then the method I wrote needs to be removed and the implementation moved to the "equals" mode in findOverlaps. Either way, findOverlaps(type="equals") should be made to work.
>>
>> Michael
>>
>>
>> On Thu, Nov 21, 2013 at 8:13 AM, Nicolas Delhomme <nicolas.delhomme at umu.se> wrote:
>> Thanks!
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Nathaniel Street Lab
>> Department of Plant Physiology
>> Umeå Plant Science Center
>>
>> Tel: +46 90 786 7989
>> Email: nicolas.delhomme at plantphys.umu.se
>> SLU - Umeå universitet
>> Umeå S-901 87 Sweden
>> ---------------------------------------------------------------
>>
>> On 21 Nov 2013, at 17:06, Michael Lawrence <lawrence.michael at gene.com> wrote:
>>
>>> I will work on this today.
>>>
>>> Michael
>>>
>>>
>>> On Thu, Nov 21, 2013 at 4:43 AM, Nicolas Delhomme <nicolas.delhomme at umu.se> wrote:
>>> Hej Bioc!
>>>
>>> When I try to find “equal” ranges from two GRangesList object, I get the following error:
>>>
>>>> findOverlaps(query=grng.def,subject=grng.mod,type="equal")
>>> Error in match.arg(type) :
>>> 'arg' should be one of “any”, “start”, “end”, “within”
>>>
>>> Isn’t type=“equal” supported for the GRangesList, GRangesList signature?
>>>
>>> Cheers,
>>>
>>> Nico
>>>
>>> sessionInfo()
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-apple-darwin13.0.0 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] easyRNASeq_1.8.2 ShortRead_1.20.0 Rsamtools_1.14.1 GenomicRanges_1.14.3 DESeq_1.14.0 lattice_0.20-24 locfit_1.5-9.1
>>> [8] Biostrings_2.30.1 XVector_0.2.0 IRanges_1.20.5 edgeR_3.4.0 limma_3.18.3 biomaRt_2.18.0 Biobase_2.22.0
>>> [15] genomeIntervals_1.18.0 BiocGenerics_0.8.0 intervals_0.14.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 hwriter_1.3
>>> [9] latticeExtra_0.6-26 LSD_2.5 RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 splines_3.0.2 stats4_3.0.2 survival_2.37-4
>>> [17] tools_3.0.2 XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0
>>>
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Nathaniel Street Lab
>>> Department of Plant Physiology
>>> Umeå Plant Science Center
>>>
>>> Tel: +46 90 786 7989
>>> Email: nicolas.delhomme at plantphys.umu.se
>>> SLU - Umeå universitet
>>> Umeå S-901 87 Sweden
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list