[BioC] GRanges setdiff With Features on Both Strands

Hervé Pagès hpages at fhcrc.org
Mon Jan 23 08:32:11 CET 2012


Hi Dario,

Unfortunately the interpretation of the * level in the strand is a
little bit ambiguous: it can mean (a) "range/feature is on both
strands" but it can also mean (b) "strand is unknown" or "strand
is irrelevant". We used to allow NAs in the strand factor in the
early days of the GenomicRanges package to represent (b) so we
were able to distinguish between (a) and (b). But this didn't prove
to be useful enough to justify the "cost" so we decided to not
allow NAs in the strand anymore.

Another somewhat arbitrary design decision was that all the "set
operations" would treat * as a separate (virtual) strand:

 > library(GenomicRanges)
 > x <- GRanges("chr1", IRanges(2, 7), strand="+")
 > y <- GRanges("chr1", IRanges(5, 10), strand="*")

 > union(x, y)
GRanges with 2 ranges and 0 elementMetadata values:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
   [1]     chr1   [2,  7]      +
   [2]     chr1   [5, 10]      *
   ---
   seqlengths:
    chr1
      NA

 > intersect(x, y)
GRanges with 0 ranges and 0 elementMetadata values:
    seqnames    ranges strand
       <Rle> <IRanges>  <Rle>
   ---
   seqlengths:
    chr1
      NA

 > setdiff(x, y)
GRanges with 1 range and 0 elementMetadata values:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
   [1]     chr1    [2, 7]      +
   ---
   seqlengths:
    chr1
      NA

So, by default (i.e. with ignore.strand=FALSE), setdiff() is just being
consistent with the other set operations.

Not necessarily the most natural/intuitive behavior but it has the big
advantage to be something that all GenomicRanges methods can agree on,
and to be easy to describe and simple to implement.

To get what you want, you'll need to do setdiff() twice:

 > strand(y) <- "+"
 > z <- setdiff(x, y)
 > strand(y) <- "-"
 > z <- setdiff(z, y)
 > z
GRanges with 1 range and 0 elementMetadata values:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
   [1]     chr1    [2, 4]      +
   ---
   seqlengths:
    chr1
      NA

As for the weird behavior you get with ignore.strand=TRUE, I agree
it doesn't make much sense and it's not totally clear to me what the
correct behavior should be. Sadly the man page doesn't say anything
about what the intended behavior is (there is only a brief note about
what ignore.strand=TRUE does for parallel set operations), gives no
example, and this case is not covered by the unit tests either.

However, by looking at what union() and gaps() do with
ignore.strand=TRUE, and also at the internals of the GenomicRanges
package, it seems to me that the intention was that:

   union(x, y, ignore.strand=TRUE)
   intersect(x, y, ignore.strand=TRUE)
   setdiff(x, y, ignore.strand=TRUE)

would be equivalent to:

   union(`strand<-`(x, "+"), `strand<-`(y, "+"))
   intersect(`strand<-`(x, "+"), `strand<-`(y, "+"))
   setdiff(`strand<-`(x, "+"), `strand<-`(y, "+"))

So I could fix setdiff() and intersect() to do this (intersect() has
a weird behavior too), and update the documentation. Does that sound
reasonable?

Finally I'd like to mention that I see very little value in supporting
the 'ignore.strand' arg in those 3 methods so another option would be
to just get rid of it.

Cheers,
H.


On 01/15/2012 10:00 PM, Dario Strbenac wrote:
> Hi,
>
> setdiff doesn't work the way I thought it would when one GRanges has strands and the other does not. I thought that features stranded as * would be taken as being on both strands. The output is even stranger if I use ignore.strand = TRUE.
>
>> x
> GRanges with 1 range and 0 elementMetadata values:
>        seqnames    ranges strand
>           <Rle>  <IRanges>   <Rle>
>    [1]     chr1    [2, 7]      +
>    ---
>    seqlengths:
>     chr1
>       NA
>> y
> GRanges with 1 range and 0 elementMetadata values:
>        seqnames    ranges strand
>           <Rle>  <IRanges>   <Rle>
>    [1]     chr1   [5, 10]      *
>    ---
>    seqlengths:
>     chr1
>       NA
>> setdiff(x, y)
> GRanges with 1 range and 0 elementMetadata values:
>        seqnames    ranges strand
>           <Rle>  <IRanges>   <Rle>
>    [1]     chr1    [2, 7]      +
>    ---
>    seqlengths:
>     chr1
>       NA
>> setdiff(x, y, ignore.strand = TRUE)
> GRanges with 2 ranges and 0 elementMetadata values:
>        seqnames    ranges strand
>           <Rle>  <IRanges>   <Rle>
>    [1]     chr1   [1, 10]      -
>    [2]     chr1   [1, 10]      *
>    ---
>    seqlengths:
>     chr1
>       NA
>
> How do I get a result of chr1 [2, 4] + ?
>
> R version 2.14.0 (2011-10-31), GenomicRanges_1.6.4, IRanges_1.12.5
>
> --------------------------------------
> Dario Strbenac
> Research Assistant
> Cancer Epigenetics
> Garvan Institute of Medical Research
> Darlinghurst NSW 2010
> Australia
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list