[BioC] Semantics of GenomicRanges gaps()

Fri Aug 23 20:31:46 CEST 2013

Hi Dan,

Thanks for catching this. Herve is currently out of town. We'll put this 
on the TODO and address it when he is back.

Valerie

On 08/21/2013 12:13 AM, Dan Du wrote:
> Hi all,
>
> Sorry to wake up this sleeping beauty.
>
> Just something more to add and consider, there is another inconsistency
> in the gaps function for GRanges, when there is no seqinfo present. The
> default gaps will not produce extra ranges for those unused strands,
> unless one specify the "end" argument, however only setting "start"
> won't trigger the extra results. This is the current behavior of both
> release and dev branch,
>
> ## rls
> GenomicRanges_1.13.36 IRanges_1.19.24
> ## dev
> GenomicRanges_1.12.4 IRanges_1.18.3
>
> ## case without seqinfo
> g<-GRanges(seqnames="1", IRanges(3, 10), strand='*')
> seqinfo=Seqinfo("1",seqlengths=1000))
> g
> GRanges with 1 range and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1   [3, 10]      *
>    ---
>    seqlengths:
>      1
>     NA
>
> ## default, no extra
> gaps(g) # this is expected, start always defaults to 1L
> GRanges with 1 range and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1    [1, 2]      *
>    ---
>    seqlengths:
>      1
>     NA
>
> ## with start, no extra
> gaps(g, start=2)
> GRanges with 1 range and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1    [2, 2]      *
>    ---
>    seqlengths:
>      1
>     NA
> ## with end, strand comes into play, giving two more
> gaps(g, end=20)
> GRanges with 4 ranges and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1  [ 1, 20]      +
>    [2]        1  [ 1, 20]      -
>    [3]        1  [ 1,  2]      *
>    [4]        1  [11, 20]      *
>    ---
>    seqlengths:
>      1
>     NA
>
> ## with both start and end, same as above
> gaps(g, start=2, end=20) #
> GRanges with 4 ranges and 0 metadata columns:
>        seqnames    ranges strand
>           <Rle> <IRanges>  <Rle>
>    [1]        1  [ 2, 20]      +
>    [2]        1  [ 2, 20]      -
>    [3]        1  [ 2,  2]      *
>    [4]        1  [11, 20]      *
>    ---
>    seqlengths:
>      1
> ## reset seqinfo
> seqinfo(g)<-Seqinfo("1",seqlengths=1000)
>
> ## now the extra ranges are back
> gaps(g, start=2)
> GRanges with 4 ranges and 0 metadata columns:
>        seqnames     ranges strand
>           <Rle>  <IRanges>  <Rle>
>    [1]        1 [ 2, 1000]      +
>    [2]        1 [ 2, 1000]      -
>    [3]        1 [ 2,    2]      *
>    [4]        1 [11, 1000]      *
>    ---
>    seqlengths:
>        1
>     1000
>
> Best,
> Dan
>
> On Mon, 2013-06-03 at 05:54 -0700, Michael Lawrence wrote:
>> Not sure how confusing this would be, but a common case is (like Alex's
>> input) a set of ranges where the strand is uniform, i.e., all ranges have
>> strand 'x'. In that case, gaps,GenomicRanges could behave just like
>> gaps,Ranges and return only ranges on strand 'x'. That would satisfy both
>> properties.
>>
>> A natural extension would be that if all ranges are '+' or '-' (none of
>> them are '*'), then do not add the '*' range spanning the whole sequence.
>> Also satisfies both properties.
>>
>> I only know of one use-case where all three strand types are mixed:
>> rna-seq, where the strand has been inferred only from the spliced
>> alignments. Not if gaps() would even be useful in that case, but if it
>> were, I think the current gaps behavior would make sense.
>>
>> Michael
>>
>>
>>
>>
>> On Sun, Jun 2, 2013 at 11:36 PM, Herv Pags <hpages at fhcrc.org> wrote:
>>
>>> Hi Alex,
>>>
>>>
>>> On 05/31/2013 01:46 AM, Alex Gutteridge wrote:
>>>
>>>> I just have a quick question/comment about the behaviour of the gaps
>>>> function in the GenomicRanges package, particularly with how it handles
>>>> * strand ranges. The current behaviour is as below:
>>>>
>>>>   range = GRanges(seqnames="1",IRanges(**start=300,end=700), strand="*",
>>>>> seqinfo=Seqinfo("1",**seqlengths=1000))
>>>>> range
>>>>>
>>>> GRanges with 1 range and 0 metadata columns:
>>>>         seqnames     ranges strand
>>>>            <Rle>  <IRanges>  <Rle>
>>>>     [1]        1 [300, 700]      *
>>>>     ---
>>>>     seqlengths:
>>>>         1
>>>>      1000
>>>>
>>>>> gaps(range)
>>>>>
>>>> GRanges with 4 ranges and 0 metadata columns:
>>>>         seqnames      ranges strand
>>>>            <Rle>   <IRanges>  <Rle>
>>>>     [1]        1 [  1, 1000]      +
>>>>     [2]        1 [  1, 1000]      -
>>>>     [3]        1 [  1,  299]      *
>>>>     [4]        1 [701, 1000]      *
>>>>     ---
>>>>     seqlengths:
>>>>         1
>>>>      1000
>>>>
>>>> My interpretation of "*" as a strand identifier is that it means the
>>>> range covers both + and - strands and so intuitively I was expecting the
>>>> 'gaps' to only cover the 3rd and 4th ranges returned above (not the
>>>> full-length + and - strand ranges).
>>>>
>>>
>>> It's even worse than that. If there is no range at all on a chromosome,
>>> gaps(gr) will return 3 ranges covering the full chromosome:
>>>
>>>    > gr <- GRanges("chr1",
>>>                    IRanges(start=300,end=700),
>>>                    seqlengths=c(chr1=1000,chr2=**1000))
>>>
>>>    > gr
>>>
>>>    GRanges with 1 range and 0 metadata columns:
>>>          seqnames     ranges strand
>>>             <Rle>  <IRanges>  <Rle>
>>>      [1]     chr1 [300, 700]      *
>>>      ---
>>>      seqlengths:
>>>       chr1  chr2
>>>       1000 1000
>>>
>>>    > gaps(gr)
>>>
>>>    GRanges with 7 ranges and 0 metadata columns:
>>>          seqnames      ranges strand
>>>             <Rle>   <IRanges>  <Rle>
>>>      [1]     chr1 [  1, 1000]      +
>>>      [2]     chr1 [  1, 1000]      -
>>>      [3]     chr1 [  1,  299]      *
>>>      [4]     chr1 [701, 1000]      *
>>>      [5]     chr2 [  1, 1000]      +
>>>      [6]     chr2 [  1, 1000]      -
>>>      [7]     chr2 [  1, 1000]      *
>>>      ---
>>>      seqlengths:
>>>       chr1  chr2
>>>       1000 1000
>>>
>>>
>>>   The semantics here implies to me
>>>> that the * strand is being thought of as a kind of imaginary third
>>>> strand and hence doesn't overlap with the + or - strands. This is
>>>> contrary to the semantics of findOverlaps which does appear to consider
>>>> a * strand range to overlap with + or - strand ranges:
>>>>
>>>>   findOverlaps(range,gaps(range)**)
>>>>>
>>>> Hits of length 2
>>>> queryLength: 1
>>>> subjectLength: 4
>>>>     queryHits subjectHits
>>>>      <integer>   <integer>
>>>>    1         1           1
>>>>    2         1           2
>>>>
>>>> To summarise I guess I was expecting findOverlaps(range,gaps(range)**) to
>>>> never return any overlaps under any circumstance (that I can think of!).
>>>>
>>>
>>> Let's call this property 2. This sounds indeed like a good property
>>> that it would be nice to have. But an even more important property is
>>> property 1: gaps() must be its own reverse operation i.e.
>>> 'gaps(gaps(gr))' must always return 'gr'.
>>>
>>> The current behavior of gaps() guarantees property 1. I'm not against
>>> changing gaps() behavior to also have property 2, as long as property
>>> 1 is preserved. Suggestions are welcome.
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>>   Do people agree the behaviour of gaps() is not quite intuitive in this
>>>> case?
>>>>
>>>>   sessionInfo()
>>>>>
>>>> R version 2.15.2 (2012-10-26)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] GenomicRanges_1.10.7 IRanges_1.16.6       BiocGenerics_0.4.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] parallel_2.15.2 stats4_2.15.2
>>>>
>>>>
>>> --
>>> Herv Pags
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>