[BioC] countMatches() (was: table for GenomicRanges)

Sat Jan 5 05:34:10 CET 2013

Yes 'peaks %in% genes' is cute and was probably doing the right thing
for most users (although not all). But 'exons %in% genes' is cute too
and was probably doing the wrong thing for all users. Advanced users
like you guys would have no problem switching to

   !is.na(findOverlaps(peaks, genes, type="within", select="any"))

or

   !is.na(findOverlaps(peaks, genes, type="equal", select="any"))

in case 'peaks %in% genes' was not doing exactly what you wanted,
but most users would not find this particularly friendly. Even
worse, some users probably didn't realize that 'peaks %in% genes'
was not doing exactly what they thought it did because "peaks in
genes" in English suggests that the peaks are within the genes,
but it's not what 'peaks %in% genes' does.

Having overlapsAny(), with exactly the same extra arguments as
countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', 'minoverlap',
'type', 'ignore.strand'), all of them documented (and with most
users more or less familiar with them already) has the virtue to
expose the user to all the options from the very start, and to
help him/her make the right choice. Of course there will be users
that don't want or don't have the time to read/think about all the
options. Not a big deal: they'll just do 'overlapsAny(query, subject)',
which is not a lot more typing than 'query %in% subject', especially
if they use tab completion.

It's true that it's more common to ask questions about overlap than
about equality but there are some use cases for the latter (as the
original thread shows). Until now, when you had such a use case, you
could not use match() or %in%, which would have been the natural things
to use, because they got hijacked to do something else, and you were
left with nothing. Not a satisfying situation. So at a minimum, we
needed to restore the true/real/original semantic of match() to do
"equality" instead of "overlap". But it's hard to do this for match()
and not do it for %in% too. For more than 99% of R users, %in% is
just a simple wrapper for 'match(x, table, nomatch = 0) > 0' (this
is how it has been documented and implemented in base R for many
years). Not maintaining this relationship between %in% and match()
would only cause grief and frustration to newcomers to Bioconductor.

H.

On 01/04/2013 03:32 PM, Cook, Malcolm wrote:
> Hiya again,
>
> I am definitely a late comer to BioC, so I definitely easily defer to
> the tide of history.
>
> But I do think you miss my point Michael about the proposed change
> making the relationship between %in% and match for {G,I}Ranges{List}
> mimic that between other vectors, and I do think that changing the API
> would make other late-comers take to BioC easier/faster.
>
> That said, I NEVER use %in% so I really have no stake in the matter, and
> I DEFINITELY appreciate the argument to not changing the API just for
> sematic sweetness.
>
> That that said, Herve is _/so good/_ about deprecations and warnings
> that make such changes fairly easily digestible.
>
> That that that.... enough.... I bow out of this one....!!!!
>
> Always learning and Happy New Year to all lurkers,
>
> ~Malcolm
>
> *From:*Michael Lawrence [mailto:lawrence.michael at gene.com]
> *Sent:* Friday, January 04, 2013 5:11 PM
> *To:* Cook, Malcolm
> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès (hpages at fhcrc.org); Tim
> Triche, Jr.; Vedran Franke; bioconductor at r-project.org
> *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges)
>
> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <MEC at stowers.org
> <mailto:MEC at stowers.org>> wrote:
>
> Hiya,
>
> For what it is worth...
>
> I think the change to %in% is warranted.
>
> If I understand correctly, this change restores the relationship between
> the semantics of `%in` and the semantics of `match`.
>
>  From the docs:
>
>    '"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'
>
> Herve's change restores this relationship.
>
>
> match and %in% were initially consistent (both considering any overlap);
> Herve has changed both of them together. The whole idea behind IRanges
> is that ranges are special data types with special semantics. We have
> reimplemented much of the existing R vector API using those semantics;
> this extends beyond match/%in%. I am hesitant about making such sweeping
> changes to the API so late in the life-cycle of the package. There was a
> feature request for a way to count identical ranges in a set of ranges.
> Let's please not get carried away and start redesigning the API for this
> one, albeit useful, request. There are all sorts of inconsistencies in
> the API, and many of them were conscious decisions that considered
> practical use cases.
>
> Michael
>
>
>     Herve, I suspect you were you as a result able to completely drop
>     all the `%in%,BiocClass1,BiocClass2` definitions and depend upon
>     base::%in%
>
>     Am I right?
>
>     If so, may I suggest that Herve stay the course, with the addition of
>        '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L,
>     minoverlap=1L, type='any', select='all') > 0'
>
>     This would provide a perspicacious idiom, thereby optimizing the API
>     for Michaels observed common use case.
>
>     Just sayin'
>
>     ~Malcolm
>
>
>       .-----Original Message-----
>       .From: bioconductor-bounces at r-project.org
>     <mailto:bioconductor-bounces at r-project.org>
>     [mailto:bioconductor-bounces at r-project.org
>     <mailto:bioconductor-bounces at r-project.org>] On Behalf Of Sean Davis
>       .Sent: Friday, January 04, 2013 3:37 PM
>       .To: Michael Lawrence
>       .Cc: Tim Triche, Jr.; Vedran Franke; bioconductor at r-project.org
>     <mailto:bioconductor at r-project.org>
>       .Subject: Re: [BioC] countMatches() (was: table for GenomicRanges)
>       .
>       .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence
>       .<lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>       .> The change to the behavior of %in% is a pretty big one. Are you
>     thinking
>       .> that all set-based operations should behave this way? For
>     example, setdiff
>       .> and intersect? I really liked the syntax of "peaks %in% genes".
>     In my
>       .> experience, it's way more common to ask questions about overlap
>     than about
>       .> equality, so I'd rather optimize the API for that use case. But
>     again,
>       .> that's just my personal bias.
>       .
>       .For what it is worth, I share Michael's personal bias here.
>       .
>       .Sean
>       .
>       .
>       .> Michael
>       .>
>       .>
>       .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>> wrote:
>       .>
>       .>> Hi,
>       .>>
>       .>> I added findMatches() and countMatches() to the latest IRanges /
>       .>> GenomicRanges packages (in BioC devel only).
>       .>>
>       .>>   findMatches(x, table): An enhanced version of ‘match’ that
>       .>>           returns all the matches in a Hits object.
>       .>>
>       .>>   countMatches(x, table): Returns an integer vector of the length
>       .>>           of ‘x’, containing the number of matches in ‘table’ for
>       .>>           each element in ‘x’.
>       .>>
>
>       .>> countMatches() is what you can use to tally/count/tabulate
>     (choose your
>
>       .>> preferred term) the unique elements in a GRanges object:
>       .>>
>       .>>   library(GenomicRanges)
>       .>>   set.seed(33)
>       .>>   gr <- GRanges("chr1", IRanges(sample(15,20,replace=**TRUE),
>     width=5))
>       .>>
>       .>> Then:
>       .>>
>       .>>   > gr_levels <- sort(unique(gr))
>       .>>   > countMatches(gr_levels, gr)
>       .>>    [1] 1 1 1 2 4 2 2 1 2 2 2
>       .>>
>       .>> Note that findMatches() and countMatches() also work on
>     IRanges and
>       .>> DNAStringSet objects, as well as on ordinary atomic vectors:
>       .>>
>       .>>   library(hgu95av2probe)
>       .>>   library(Biostrings)
>       .>>   probes <- DNAStringSet(hgu95av2probe)
>       .>>   unique_probes <- unique(probes)
>       .>>   count <- countMatches(unique_probes, probes)
>       .>>   max(count)  # 7
>       .>>
>       .>> I made other changes in IRanges/GenomicRanges so that the notion
>       .>> of "match" between elements of a vector-like object now
>     consistently
>       .>> means "equality" instead of "overlap", even for range-based
>     objects
>       .>> like IRanges or GRanges objects. This notion of "equality" is the
>       .>> same that is used by ==. The most visible consequence of those
>       .>> changes is that using %in% between 2 IRanges or GRanges objects
>       .>> 'query' and 'subject' in order to do overlaps was replaced by
>       .>> overlapsAny(query, subject).
>       .>>
>       .>>   overlapsAny(query, subject): Finds the ranges in ‘query’ that
>       .>>      overlap any of the ranges in ‘subject’.
>       .>>
>
>       .>> There are warnings and deprecation messages in place to help
>     smooth
>
>       .>> the transition.
>       .>>
>       .>> Cheers,
>       .>> H.
>       .>>
>       .>> --
>       .>> Hervé Pagès
>       .>>
>       .>> Program in Computational Biology
>       .>> Division of Public Health Sciences
>       .>> Fred Hutchinson Cancer Research Center
>       .>> 1100 Fairview Ave. N, M1-B514
>       .>> P.O. Box 19024
>       .>> Seattle, WA 98109-1024
>       .>>
>       .>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>       .>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>       .>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>       .>>
>       .>
>       .>         [[alternative HTML version deleted]]
>       .>
>       .>
>       .> _______________________________________________
>       .> Bioconductor mailing list
>       .> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>       .> https://stat.ethz.ch/mailman/listinfo/bioconductor
>       .> Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>       .
>       ._______________________________________________
>       .Bioconductor mailing list
>       .Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>       .https://stat.ethz.ch/mailman/listinfo/bioconductor
>       .Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319