[BioC] countMatches() (was: table for GenomicRanges)
Cook, Malcolm
MEC at stowers.org
Tue Jan 8 18:38:56 CET 2013
.Hi Tim,
.
.I could add the %ov% operator as a replacement for the old %in%. So you
.would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would just
.be a convenience wrapper for 'overlapsAny(peaks, genes)'.
[cloak off]
Herve, I think this is the BEST course, and except for one letter, is what I hoped I meant back when I wrote:
> If so, may I suggest that Herve stay the
> course, with the
> addition of
> '"%ol%" <- function(a, b) findOverlaps(a,
> b, maxgap=0L,
> minoverlap=1L, type='any', select='all') > 0'
Stay the course, captain.
[cloak on]
.
.Cheers,
.H.
.
.On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote:
.> So why not leave %in% as it was and transition everything forward to
.> explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, `%equals%`
.> } such that
.>
.> identical( x %within% table, countOverlaps(x, table, type='within') >
.> 0 ) == TRUE
.> identical( x %overlaps% table, countOverlaps(x, table, type='any') >
.> 0 ) == TRUE
.> identical( x %equals% table, countOverlaps(x, table, type='equal') >
.> 0 ) == TRUE
.>
.> and for the time being,
.>
.> identical( x %overlaps% table, countOverlaps(x, table, type='any') >
.> 0 ) == TRUE ## but with a noisy nastygram that will halt if
.> options("warn"=2)
.> No breakage for %in% methods until such time as a full deprecation cycle
.> has passed, and if the maintainers can't be arsed to do anything at all
.> about the warnings by the second full release, then perhaps they don't
.> really care that much after all. Just a thought?
.>
.> From someone (me) who has their own issues with keeping everything up
.> to date and should know better. If you want to use %in% for
.>
.> peaks %in% genes (why on earth would you do this rather than peaks
.> %in% promoters(genes), anyways?)
.>
.> then a nastygram could be emitted "WARNING: YOUR SHORTHAND NOTATION IS
.> DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is (more
.> or less) happy.
.>
.>
.>
.> On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence
.> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
.>
.>
.>
.>
.> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages at fhcrc.org
.> <mailto:hpages at fhcrc.org>> wrote:
.>
.> Hi Michael,
.>
.> I don't think "match" (the word) always has to mean "equality"
.> either.
.> However having match() (the function) do "whole exact matching" (aka
.> "equality") for any kind of vector-like object has the advantage of:
.>
.> (a) making it consistent with base::match() (?base::match is
.> pretty
.> explicit about what the contract of match() is)
.>
.>
.> (a) alone is obviously not enough. We have many methods, like the
.> set operations, that treat ranges specially. Are we going to start
.> moving everything toward the base behavior? And have rangeIntersect,
.> rangeSetdiff, etc?
.>
.> (b) preserving its relationship with ==, duplicated(), unique(),
.> etc...
.>
.>
.> So it becomes consistent with duplicated/unique, but we lose
.> consistency with the set operations.
.>
.> (c) not frustrating the user who needs something to do exact
.> matching on ranges (as I mentioned previously, if you take
.> match() away from him/her, s/he'll be left with nothing).
.>
.>
.> No one has ever asked for match() to behave this way. There was a
.> request for a way to tabulate identical ranges. It was a nice idea
.> to extract the general "outer equal" findMatches function. But the
.> changes seem to be snow-balling. These types of changes mean a lot
.> of maintenance work for the users. A deprecation cycle does not
.> circumvent that.
.>
.>
.> IMO those advantages counterbalance *by far* the very little
.> convenience you get from having 'match(query, subject)' do
.> 'findOverlaps(query, subject, select="first")' on
.> IRanges/GRanges objects. If you need to do that, just use the
.> latter, or, if you think that's still too much typing, define
.> a wrapper e.g. 'ovmatch(query, subject)'.
.>
.> There are plenty of specialized tools around for doing
.> inexact/fuzzy/partial/overlap matching for many particular types
.> of vector-like objects: grep() and family, pmatch(), charmatch(),
.> agrep(), grepRaw(), matchPattern() and family, findOverlaps() and
.> family, findIntervals(), etc... For the reasons I mentioned
.> above, none of them should hijack match() to make it do some
.> particular type of inexact matching on some particular type of
.> objects. Even if, for that particular type of objects, doing that
.> particular type of inexact matching is more common than doing
.> exact matching.
.>
.> H.
.>
.>
.>
.> On 01/06/2013 05:39 PM, Michael Lawrence wrote:
.>
.> I think having overlapsAny is a nice addition and helps make
.> the API
.> more complete and explicit. Are you sure we need to change
.> the behavior
.> of the match method for this relatively uncommon use case?
.>
.>
.> Yes because otherwise users with a use case of doing match()
.>
.> even if it's uncommon,
.>
.>
.> I don't think
.> "match" always has to mean "equality". It is a more general
.> concept in
.> my mind. The most common use case for matching ranges is
.> overlap.
.>
.>
.> Of course "match" doesn't always have to mean equality. But of base
.>
.>
.> Michael
.>
.>
.> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès
.> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
.> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
.>
.> Yes 'peaks %in% genes' is cute and was probably doing
.> the right thing
.> for most users (although not all). But 'exons %in%
.> genes' is cute too
.> and was probably doing the wrong thing for all users.
.> Advanced users
.> like you guys would have no problem switching to
.>
.> !is.na <http://is.na>
.> <http://is.na>(findOverlaps(__peaks, genes, type="within",
.> select="any"))
.>
.> or
.>
.> !is.na <http://is.na>
.> <http://is.na>(findOverlaps(__peaks, genes, type="equal",
.>
.> select="any"))
.>
.> in case 'peaks %in% genes' was not doing exactly what
.> you wanted,
.> but most users would not find this particularly
.> friendly. Even
.> worse, some users probably didn't realize that 'peaks
.> %in% genes'
.> was not doing exactly what they thought it did because
.> "peaks in
.> genes" in English suggests that the peaks are within
.> the genes,
.> but it's not what 'peaks %in% genes' does.
.>
.> Having overlapsAny(), with exactly the same extra
.> arguments as
.> countOverlaps() and subsetByOverlaps() (i.e. 'maxgap',
.> 'minoverlap',
.> 'type', 'ignore.strand'), all of them documented (and
.> with most
.> users more or less familiar with them already) has the
.> virtue to
.> expose the user to all the options from the very start,
.> and to
.> help him/her make the right choice. Of course there
.> will be users
.> that don't want or don't have the time to read/think
.> about all the
.> options. Not a big deal: they'll just do
.> 'overlapsAny(query, subject)',
.> which is not a lot more typing than 'query %in%
.> subject', especially
.> if they use tab completion.
.>
.> It's true that it's more common to ask questions about
.> overlap than
.> about equality but there are some use cases for the
.> latter (as the
.> original thread shows). Until now, when you had such a
.> use case, you
.> could not use match() or %in%, which would have been
.> the natural things
.> to use, because they got hijacked to do something else,
.> and you were
.> left with nothing. Not a satisfying situation. So at a
.> minimum, we
.> needed to restore the true/real/original semantic of
.> match() to do
.> "equality" instead of "overlap". But it's hard to do
.> this for match()
.> and not do it for %in% too. For more than 99% of R
.> users, %in% is
.> just a simple wrapper for 'match(x, table, nomatch = 0)
.> > 0' (this
.> is how it has been documented and implemented in base R
.> for many
.> years). Not maintaining this relationship between %in%
.> and match()
.> would only cause grief and frustration to newcomers to
.> Bioconductor.
.>
.> H.
.>
.>
.>
.> On 01/04/2013 03:32 PM, Cook, Malcolm wrote:
.>
.> Hiya again,
.>
.> I am definitely a late comer to BioC, so I
.> definitely easily
.> defer to
.> the tide of history.
.>
.> But I do think you miss my point Michael about the
.> proposed change
.> making the relationship between %in% and match for
.> {G,I}Ranges{List}
.> mimic that between other vectors, and I do think
.> that changing
.> the API
.> would make other late-comers take to BioC
.> easier/faster.
.>
.> That said, I NEVER use %in% so I really have no
.> stake in the
.> matter, and
.> I DEFINITELY appreciate the argument to not
.> changing the API
.> just for
.> sematic sweetness.
.>
.> That that said, Herve is _/so good/_ about
.> deprecations and warnings
.>
.> that make such changes fairly easily digestible.
.>
.> That that that.... enough.... I bow out of this
.> one....!!!!
.>
.> Always learning and Happy New Year to all lurkers,
.>
.> ~Malcolm
.>
.> *From:*Michael Lawrence
.> [mailto:lawrence.michael at gene.
.> <mailto:lawrence.michael at gene.>____com
.>
.> <mailto:lawrence.michael at gene.__com
.> <mailto:lawrence.michael at gene.com>>]
.> *Sent:* Friday, January 04, 2013 5:11 PM
.> *To:* Cook, Malcolm
.> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès
.> (hpages at fhcrc.org <mailto:hpages at fhcrc.org>
.> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>); Tim
.>
.>
.> Triche, Jr.; Vedran Franke;
.> bioconductor at r-project.org <mailto:bioconductor at r-project.org>
.> <mailto:bioconductor at r-__project.org
.> <mailto:bioconductor at r-project.org>>
.> *Subject:* Re: [BioC] countMatches() (was: table
.> for GenomicRanges)
.>
.>
.> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm
.> <MEC at stowers.org <mailto:MEC at stowers.org>
.> <mailto:MEC at stowers.org <mailto:MEC at stowers.org>>
.> <mailto:MEC at stowers.org <mailto:MEC at stowers.org>
.> <mailto:MEC at stowers.org <mailto:MEC at stowers.org>>>> wrote:
.>
.> Hiya,
.>
.> For what it is worth...
.>
.> I think the change to %in% is warranted.
.>
.> If I understand correctly, this change restores the
.> relationship
.> between
.> the semantics of `%in` and the semantics of `match`.
.>
.> From the docs:
.>
.> '"%in%" <- function(x, table) match(x, table,
.> nomatch = 0) > 0'
.>
.> Herve's change restores this relationship.
.>
.>
.> match and %in% were initially consistent (both
.> considering any
.> overlap);
.> Herve has changed both of them together. The whole
.> idea behind
.> IRanges
.> is that ranges are special data types with special
.> semantics. We
.> have
.> reimplemented much of the existing R vector API
.> using those
.> semantics;
.> this extends beyond match/%in%. I am hesitant about
.> making such
.> sweeping
.> changes to the API so late in the life-cycle of the
.> package.
.> There was a
.> feature request for a way to count identical ranges
.> in a set of
.> ranges.
.> Let's please not get carried away and start
.> redesigning the API
.> for this
.> one, albeit useful, request. There are all sorts of
.> inconsistencies in
.> the API, and many of them were conscious decisions
.> that considered
.> practical use cases.
.>
.> Michael
.>
.>
.> Herve, I suspect you were you as a result able to
.> completely drop
.> all the `%in%,BiocClass1,BiocClass2`
.> definitions and depend
.> upon
.> base::%in%
.>
.> Am I right?
.>
.> If so, may I suggest that Herve stay the
.> course, with the
.> addition of
.> '"%ol%" <- function(a, b) findOverlaps(a,
.> b, maxgap=0L,
.> minoverlap=1L, type='any', select='all') > 0'
.>
.> This would provide a perspicacious idiom, thereby
.> optimizing the API
.> for Michaels observed common use case.
.>
.> Just sayin'
.>
.> ~Malcolm
.>
.>
.> .-----Original Message-----
.> .From:
.> bioconductor-bounces at r-____project.org
.> <mailto:bioconductor-bounces at r-__project.org>
.> <mailto:bioconductor-bounces at __r-project.org
.> <mailto:bioconductor-bounces at r-project.org>>
.> <mailto:bioconductor-bounces@
.> <mailto:bioconductor-bounces@>____r-project.org
.> <http://r-project.org>
.> <mailto:bioconductor-bounces at __r-project.org
.> <mailto:bioconductor-bounces at r-project.org>>>
.> [mailto:bioconductor-bounces@
.> <mailto:bioconductor-bounces@>____r-project.org
.> <http://r-project.org>
.> <mailto:bioconductor-bounces at __r-project.org
.> <mailto:bioconductor-bounces at r-project.org>>
.>
.> <mailto:bioconductor-bounces@
.> <mailto:bioconductor-bounces@>____r-project.org
.> <http://r-project.org>
.>
.> <mailto:bioconductor-bounces at __r-project.org
.> <mailto:bioconductor-bounces at r-project.org>>>] On Behalf Of Sean
.> Davis
.> .Sent: Friday, January 04, 2013 3:37 PM
.> .To: Michael Lawrence
.> .Cc: Tim Triche, Jr.; Vedran Franke;
.> bioconductor at r-project.org
.> <mailto:bioconductor at r-project.org>
.> <mailto:bioconductor at r-__project.org
.> <mailto:bioconductor at r-project.org>>
.> <mailto:bioconductor at r-____project.org
.> <mailto:bioconductor at r-__project.org>
.>
.> <mailto:bioconductor at r-__project.org
.> <mailto:bioconductor at r-project.org>>>
.>
.> .Subject: Re: [BioC] countMatches() (was:
.> table for
.> GenomicRanges)
.> .
.> .On Fri, Jan 4, 2013 at 4:32 PM, Michael
.> Lawrence
.> .<lawrence.michael at gene.com
.> <mailto:lawrence.michael at gene.com>
.> <mailto:lawrence.michael at gene.__com
.> <mailto:lawrence.michael at gene.com>>
.> <mailto:lawrence.michael at gene.
.> <mailto:lawrence.michael at gene.>____com
.>
.> <mailto:lawrence.michael at gene.__com
.> <mailto:lawrence.michael at gene.com>>>> wrote:
.> .> The change to the behavior of %in% is a
.> pretty big
.> one. Are you
.> thinking
.> .> that all set-based operations should
.> behave this way? For
.> example, setdiff
.> .> and intersect? I really liked the syntax
.> of "peaks
.> %in% genes".
.> In my
.> .> experience, it's way more common to ask
.> questions
.> about overlap
.> than about
.> .> equality, so I'd rather optimize the API
.> for that use
.> case. But
.> again,
.> .> that's just my personal bias.
.> .
.> .For what it is worth, I share Michael's
.> personal bias here.
.> .
.> .Sean
.> .
.> .
.> .> Michael
.> .>
.> .>
.> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès
.> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
.> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
.> <mailto:hpages at fhcrc.org
.> <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
.> <mailto:hpages at fhcrc.org>>>> wrote:
.> .>
.> .>> Hi,
.> .>>
.> .>> I added findMatches() and countMatches()
.> to the
.> latest IRanges /
.> .>> GenomicRanges packages (in BioC devel only).
.> .>>
.> .>> findMatches(x, table): An enhanced
.> version of
.> 'match' that
.> .>> returns all the matches in a
.> Hits object.
.> .>>
.> .>> countMatches(x, table): Returns an
.> integer vector
.> of the length
.> .>> of 'x', containing the number
.> of matches in
.> 'table' for
.> .>> each element in 'x'.
.> .>>
.>
.> .>> countMatches() is what you can use to
.> tally/count/tabulate
.> (choose your
.>
.> .>> preferred term) the unique elements in a
.> GRanges object:
.> .>>
.> .>> library(GenomicRanges)
.> .>> set.seed(33)
.> .>> gr <- GRanges("chr1",
.> IRanges(sample(15,20,replace=*____*TRUE),
.>
.> width=5))
.> .>>
.> .>> Then:
.> .>>
.> .>> > gr_levels <- sort(unique(gr))
.> .>> > countMatches(gr_levels, gr)
.> .>> [1] 1 1 1 2 4 2 2 1 2 2 2
.> .>>
.> .>> Note that findMatches() and
.> countMatches() also work on
.> IRanges and
.> .>> DNAStringSet objects, as well as on
.> ordinary atomic
.> vectors:
.> .>>
.> .>> library(hgu95av2probe)
.> .>> library(Biostrings)
.> .>> probes <- DNAStringSet(hgu95av2probe)
.> .>> unique_probes <- unique(probes)
.> .>> count <- countMatches(unique_probes,
.> probes)
.> .>> max(count) # 7
.> .>>
.> .>> I made other changes in
.> IRanges/GenomicRanges so that
.> the notion
.> .>> of "match" between elements of a
.> vector-like object now
.> consistently
.> .>> means "equality" instead of "overlap",
.> even for
.> range-based
.> objects
.> .>> like IRanges or GRanges objects. This
.> notion of
.> "equality" is the
.> .>> same that is used by ==. The most
.> visible consequence
.> of those
.> .>> changes is that using %in% between 2
.> IRanges or
.> GRanges objects
.> .>> 'query' and 'subject' in order to do
.> overlaps was
.> replaced by
.> .>> overlapsAny(query, subject).
.> .>>
.> .>> overlapsAny(query, subject): Finds the
.> ranges in
.> 'query' that
.> .>> overlap any of the ranges in 'subject'.
.> .>>
.>
.> .>> There are warnings and deprecation
.> messages in place
.> to help
.> smooth
.>
.> .>> the transition.
.> .>>
.> .>> Cheers,
.> .>> H.
.> .>>
.> .>> --
.> .>> Hervé Pagès
.> .>>
.> .>> Program in Computational Biology
.> .>> Division of Public Health Sciences
.> .>> Fred Hutchinson Cancer Research Center
.> .>> 1100 Fairview Ave. N, M1-B514
.> .>> P.O. Box 19024
.> .>> Seattle, WA 98109-1024
.> .>>
.> .>> E-mail: hpages at fhcrc.org
.> <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
.> <mailto:hpages at fhcrc.org>>
.> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
.> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>>
.>
.> .>> Phone: (206) 667-5791
.> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791>
.> <tel:%28206%29%20667-5791>
.> .>> Fax: (206) 667-1319
.> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
.> <tel:%28206%29%20667-1319>
.>
.> .>>
.> .>
.> .> [[alternative HTML version deleted]]
.> .>
.> .>
.> .>
.> ___________________________________________________
.>
.> .> Bioconductor mailing list
.> .> Bioconductor at r-project.org
.> <mailto:Bioconductor at r-project.org>
.> <mailto:Bioconductor at r-__project.org
.> <mailto:Bioconductor at r-project.org>>
.> <mailto:Bioconductor at r-____project.org
.> <mailto:Bioconductor at r-__project.org>
.> <mailto:Bioconductor at r-__project.org
.> <mailto:Bioconductor at r-project.org>>>
.>
.> .>
.> https://stat.ethz.ch/mailman/____listinfo/bioconductor
.> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
.>
.>
.> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
.> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
.> .> Search the archives:
.> http://news.gmane.org/gmane.____science.biology.informatics.____conductor
.> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
.>
.> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
.> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
.> .
.>
.> .___________________________________________________
.>
.> .Bioconductor mailing list
.> .Bioconductor at r-project.org
.> <mailto:Bioconductor at r-project.org>
.> <mailto:Bioconductor at r-__project.org
.> <mailto:Bioconductor at r-project.org>>
.> <mailto:Bioconductor at r-____project.org
.> <mailto:Bioconductor at r-__project.org>
.> <mailto:Bioconductor at r-__project.org
.> <mailto:Bioconductor at r-project.org>>>
.>
.>
.> .https://stat.ethz.ch/mailman/____listinfo/bioconductor
.> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
.>
.>
.> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
.> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
.> .Search the archives:
.> http://news.gmane.org/gmane.____science.biology.informatics.____conductor
.> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
.>
.>
.> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
.> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
.>
.>
.> --
.> Hervé Pagès
.>
.> Program in Computational Biology
.> Division of Public Health Sciences
.> Fred Hutchinson Cancer Research Center
.> 1100 Fairview Ave. N, M1-B514
.> P.O. Box 19024
.> Seattle, WA 98109-1024
.>
.> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
.> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
.>
.> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
.> <tel:%28206%29%20667-5791>
.> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
.> <tel:%28206%29%20667-1319>
.>
.>
.>
.> --
.> Hervé Pagès
.>
.> Program in Computational Biology
.> Division of Public Health Sciences
.> Fred Hutchinson Cancer Research Center
.> 1100 Fairview Ave. N, M1-B514
.> P.O. Box 19024
.> Seattle, WA 98109-1024
.>
.> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
.> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
.> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
.>
.>
.>
.>
.>
.> --
.> /A model is a lie that helps you see the truth./
.> /
.> /
.> Howard Skipper
.> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
.
.--
.Hervé Pagès
.
.Program in Computational Biology
.Division of Public Health Sciences
.Fred Hutchinson Cancer Research Center
.1100 Fairview Ave. N, M1-B514
.P.O. Box 19024
.Seattle, WA 98109-1024
.
.E-mail: hpages at fhcrc.org
.Phone: (206) 667-5791
.Fax: (206) 667-1319
More information about the Bioconductor
mailing list