[BioC] table for GenomicRanges

Hervé Pagès hpages at fhcrc.org
Thu Dec 6 21:20:23 CET 2012


Thanks Tim for the feedback.

I spent a little more time thinking about this and then realized that
it was related to some other feature I've been thinking of a few months
ago, then I forgot about. This other feature extends the capability
of match() by finding *all* the matches (not just the first like
match()). Surprisingly this is not in base R (but maybe it is, I'm
just not aware of it). So I'd like to add the 2 following functions
to IRanges:

   findMatches(query, subject)
   countMatches(query, subject)

The parallel with findOverlaps/countOverlaps is obvious except that by
"match" here we mean "exact match", so the concept is valid for any
vector-like query and subject with elements that can be compared (i.e.
query[i] == subject[j] is defined).

Like findOverlaps(), findMatches() will return a Hits object.
No 'select' arg for now but we could always add it later and
findOverlaps(. , select="first") would be equivalent to match(.)

I already have initial implementations for those 2 functions (with
room for some performance improvements).

I mention this here because once we have countMatches(), implementing
the count/tally feature becomes very simple. It's just:

   x_levels <- sort(unique(x))
   mcols(x_levels)$count <- countMatches(x_levels, x)

Now it's so simple that I wonder if it's still worth providing a
function for it. By using countMatches() the user can choose where
to put the vector of counts, and how to name the metadata col in
case s/he wants to stick it in 'x_levels'.

I'll start by adding findMatches()/countMatches(). I think they are
valuable per-se.

Cheers,
H.


On 12/05/2012 07:36 PM, Tim Triche, Jr. wrote:
> Now that I am rereading what I wrote, I am wondering whether a column
> labeled $count for a given combination of (seqname, start, end, strand)
> is more immediately obvious to a user, i.e. "there are mcols(GR)$count
> ranges that match this exact description in this here GR".
>
> Perhaps count() really is the way to go, especially from the standpoint
> of making it intuitive.  There are 74465 segments exactly matching the
> state "WeakEnhancer" in the HMM that I summarized (don't bother looking
> for it in ENCODE or Roadmap data, it is a bespoke model with additional
> marks).  Similarly, a "table" of unique reads placements that share a
> common range would summarize them by how many times we saw each.  And
> then the user would see counts accordingly.
>
> ->  I respectfully withdraw my proposal regarding $tally and now feel
> that $count is closer to what people would expect.  Does that make sense?
>
> Example:
>
> mimatGR <- subsetByOverlaps(readsGR, tx.mimat)
> count(mimatGR)  # table of appearances of unique reads, i.e., a library
> complexity surrogate
>
> or more interestingly, let's say there's an ncRNA and a coding RNA
> emerging from a bidirectional promoter.  Then you could have
>
> maybeNcRNA <- reduce(c(tx.ncRNA, tx.mRNA)) # forward and reverse, maybe
> lots of exons, perhaps even a little trans splicing
> bidirectionalGR <- subsetByOverlaps(readsGR, maybeNcRNA) # what's really
> going on here?  count() acts as a spot check
> count(bidirectionalGR)
>
> the latter, especially with a strand-specific RNAseq protocol, could
> tell you about some really interesting biology if conducted along a "theme".
>
> IMHO of course, since I'm not the one implementing it, but more thinking
> about the use cases I encounter.
>
> YMMV and I hope other people will point out the many ways in which I am
> heavily constipated.  :-)
>
> Thanks Herve for your ninja programming skills, you code faster than I
> can think (at least clearly).
>
> --t
>
>
>
> On Wed, Dec 5, 2012 at 7:11 PM, Tim Triche, Jr. <tim.triche at gmail.com
> <mailto:tim.triche at gmail.com>> wrote:
>
>     maybe (.)$count for tallying overlaps, and (.$)tally for counting
>     equalities, by default?
>
>     ha ha only serious.  if the default is to countOverlaps then $count
>     is a handy shortcut.  if you want to tally up matching ranges $tally
>     is more evocative
>
>
>
>     On Wed, Dec 5, 2012 at 7:06 PM, Hervé Pagès <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>> wrote:
>
>         Hi Tim,
>
>
>         On 12/05/2012 04:55 PM, Tim Triche, Jr. wrote:
>
>             that's cool, then it's consistent with score() and
>             friends... this
>             sounds like a grand scheme
>
>
>         It's not clear to me if you are voting for tally() + mcols(.)$tally
>         or for count(.) + mcols()$count?
>
>         Note that this functionality doesn't have to be restricted to
>         GenomicRanges and can be extended to any Vector object 'x' for
>         which unique(), sort() and match() work and "do the right thing".
>         Then the implementation is simply:
>
>
>              ans <- sort(unique(x))
>              names(ans) <- mcols(ans) <- NULL
>              y <- match(x, ans)
>              mcols(ans)$count <- tabulate(y, nbins=length(ans))
>              ans
>
>         Unfortunately right now match() on GenomicRanges and Ranges objects
>         reports a match in case of *overlap* instead of *equality* (this is
>         why I needed to use IRanges:::matchIntegerQuads in the
>         implementation
>         of tableGenomicRanges2() I showed previously). Just a heads-up that
>         I'd like to change this but with a transition plan e.g. with an
>         extra
>         argument to the match() generic for letting the user control what
>         "match" means (default would be "overlaps" for now but should
>         probably
>         be changed to "equality" in the future).
>
>         Cheers,
>         H.
>
>
>
>
>             On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès
>             <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
>                  So with tally, what would be the name of the metadata
>             col? Would "tally"
>                  be OK?
>
>                  It's kind of neat to use the same name for the metadata
>             col as for the
>                  function itself. That makes the code more readable:
>
>                     mcols(count(gr))$count
>
>                  Thanks for the feedback,
>                  H.
>
>
>
>                  On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote:
>
>                      that goes along nicely with BamTally in gmapR,
>             which is a damn
>                      useful
>                      function IMHO... actually I think I wrote a tally()
>             function or
>                      something like it for Rsubread, can't remember
>             whether I sent it
>                      in with
>                      the patch though.  Anyways, a running tally is a
>             good mental
>                      image for
>                      this functionality
>
>
>
>
>                      On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou
>                      <mailinglist.honeypot at gmail.____com
>                      <mailto:mailinglist.honeypot at __gmail.com
>             <mailto:mailinglist.honeypot at gmail.com>>
>                      <mailto:mailinglist.honeypot@
>             <mailto:mailinglist.honeypot@>____gmail.com <http://gmail.com>
>
>                      <mailto:mailinglist.honeypot at __gmail.com
>             <mailto:mailinglist.honeypot at gmail.com>>>>
>
>                      wrote:
>
>                           Hi,
>
>                           On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence
>                           <lawrence.michael at gene.com
>             <mailto:lawrence.michael at gene.com>
>                      <mailto:lawrence.michael at gene.__com
>             <mailto:lawrence.michael at gene.com>>
>                      <mailto:lawrence.michael at gene.
>             <mailto:lawrence.michael at gene.>____com
>
>                      <mailto:lawrence.michael at gene.__com
>             <mailto:lawrence.michael at gene.com>>>> wrote:
>                            > The question is whether there is ever a use
>             case to have
>                      a simple
>                           table.
>                            > This is analogous to base R's table and
>             data.frame. For
>                      example,
>                           if you
>                            > call xtabs(), you get a table, then you
>             have to call
>                           as.data.frame to get
>                            > back into the data.frame context. This is
>             sort of clean,
>                      and we could
>                            > create an extension of table that for
>             efficiency stores the
>                           associated
>                            > GRanges along with the counts in the .Data
>             slot. Then as(x,
>                           "GRanges") on
>                            > that would generate the GRanges with the
>             count column.
>                      That would be
>                            > complicated though.
>                            >
>                            > Another issue is that table() cannot be
>             used in the
>                      general way,
>                           due to
>                            > restrictions on dispatch with "...".
>                            >
>                            > So I think I'm in favor of a new "count"
>             generic. That
>                      naming is
>                           consistent
>                            > with countOverlaps, countSubjects,
>             countQueries, etc.
>
>                           Or maybe `tally`? Somehow I have a mental
>             association w/
>                      that being
>                           closer to `tabulate`, but I guess it's really
>             not and maybe
>                      it's just
>                           me that has a mind map that puts `tally`
>             closer to `table` than
>                           `count` is .
>
>                           --
>                           Steve Lianoglou
>                           Graduate Student: Computational Systems Biology
>                             | Memorial Sloan-Kettering Cancer Center
>                             | Weill Medical College of Cornell University
>                           Contact Info:
>             http://cbio.mskcc.org/~lianos/____contact
>             <http://cbio.mskcc.org/~lianos/__contact>
>
>                      <http://cbio.mskcc.org/~__lianos/contact
>             <http://cbio.mskcc.org/~lianos/contact>>
>
>
>
>
>                      --
>                      /A model is a lie that helps you see the truth./
>                      /
>                      /
>                      Howard Skipper
>
>             <http://cancerres.__aacrjourna__ls.org/content/31/9/__1173.__full.pdf
>             <http://aacrjournals.org/content/31/9/__1173.full.pdf>
>             <http://cancerres.__aacrjournals.org/content/31/9/__1173.full.pdf
>             <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>>>
>
>
>
>                  --
>                  Hervé Pagès
>
>                  Program in Computational Biology
>                  Division of Public Health Sciences
>                  Fred Hutchinson Cancer Research Center
>                  1100 Fairview Ave. N, M1-B514
>                  P.O. Box 19024
>                  Seattle, WA 98109-1024
>
>                  E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>                  Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>             <tel:%28206%29%20667-5791>
>                  Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>             <tel:%28206%29%20667-1319>
>
>
>
>
>
>             --
>             /A model is a lie that helps you see the truth./
>             /
>             /
>             Howard Skipper
>             <http://cancerres.__aacrjournals.org/content/31/9/__1173.full.pdf
>             <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>>
>
>
>         --
>         Hervé Pagès
>
>         Program in Computational Biology
>         Division of Public Health Sciences
>         Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N, M1-B514
>         P.O. Box 19024
>         Seattle, WA 98109-1024
>
>         E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
>     --
>     /A model is a lie that helps you see the truth./
>     /
>     /
>     Howard Skipper
>     <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list