[BioC] GenomicRanges: nearest() for GRanges not returning overlaps

Thu Jul 5 18:14:06 CEST 2012

Your statement is true except in the case where the query and subject 
are the same (original discussion below). When query and subject are the 
same (or contain >=1 of the same ranges) nearest() should choose 
self-hits. precede() and follow() will never select self-hits because 
they do not 'precede' or 'follow' them.

These unit tests may help further explain the behavior we are expecting 
for "+", "-" and "*".

     r <- IRanges(c(1,5,10), c(2,7,12))
     g <- GRanges("chr1", r, "+")
     checkEquals(precede(r), precede(g))
     checkEquals(follow(r), follow(g))
     checkEquals(nearest(r), nearest(g))

     g <- GRanges("chr1", r, "-")
     checkEquals(follow(r), precede(g))
     checkEquals(precede(r), follow(g))
     checkEquals(nearest(r), nearest(g))

     g <- GRanges("chr1", r, "*")
     checkEquals(follow(g), precede(g))
     checkEquals(nearest(r), follow(g))
     checkEquals(follow(g), nearest(g))

Valerie

On 07/04/2012 02:12 AM, James Perkins wrote:
> Thanks a lot Valerie!
>
> So precede with "*" is analogous to nearest? Since the nearest will
> also be preceding it either + or - direction?
>
> Jim
>
> On 3 July 2012 21:04, Valerie Obenchain<vobencha at fhcrc.org>  wrote:
>> Hi James,
>>
>> Thanks for the bug report. The nearest,GRanges method is currently not
>> selecting any range that includes itself. It should behave like IRanges and
>> excluded self hits when one argument is supplied but not when the same
>> argument is both query and subject. I'll post back here when it is fixed.
>>
>> Here is an example that will help explain how '*' is treated as far as
>> matching. For precede(), follow() and nearest() the '*' strand is treated as
>> '+' and '-' when looking for potential matches.
>>
>>> q<- GRanges("chr1", IRanges(c(1, 3, 9), c(2, 7, 10)))
>>> q
>> GRanges with 3 ranges and 0 elementMetadata cols:
>>        seqnames    ranges strand
>> <Rle>  <IRanges>  <Rle>
>>    [1]     chr1   [1,  2]      *
>>    [2]     chr1   [3,  7]      *
>>    [3]     chr1   [9, 10]      *
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>
>> ## The second range could be seen as preceding range 3 on '+'
>> ## or preceding range 1 on '-'. Range 1 is closer so it is chosen.
>> ## The third range precedes no range on '+' but precedes
>> ## range 2 on '-' so range 2 is chosen.
>>> precede(q)
>> [1] 2 1 2
>>> precede(ranges(q))
>> [1]  2  3 NA
>>
>> ## When the strand is '+' GRanges and IRanges behave the same.
>>> strand(q)<- "+"
>>> precede(q)
>> [1]  2  3 NA
>>> precede(ranges(q))
>> [1]  2  3 NA
>>
>> ## When the strand is '-' GRanges treats the ranges as '-' but
>> ## IRanges is unstranded so they are treated as '+'.
>>> strand(q)<- "-"
>>> precede(q)
>> [1] NA  1  2
>>> precede(ranges(q))
>> [1]  2  3 NA
>>
>> Valerie
>>
>>
>> On 07/03/12 10:58, James Perkins wrote:
>>> Thanks,
>>>
>>> But it appears in the example (using GenomicRanges_1.8.7), the second
>>> range is being excluded? I.e. 3-7 matches 5-6 (and 1-3) and thus is
>>>
>>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>> +>>   >   c(3, 6, 12)))
>>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>>> c(3, 6, 12)))
>>>> query2
>>> GRanges with 3 ranges and 0 elementMetadata cols:
>>>         seqnames    ranges strand
>>>            <Rle>   <IRanges>    <Rle>
>>>     [1]     chr1   [1,  2]      *
>>>     [2]     chr1   [3,  7]      *
>>>     [3]     chr1   [9, 10]      *
>>>     ---
>>>     seqlengths:
>>>      chr1
>>>        NA
>>>> subject2
>>> GRanges with 3 ranges and 0 elementMetadata cols:
>>>         seqnames    ranges strand
>>>            <Rle>   <IRanges>    <Rle>
>>>     [1]     chr1  [ 3,  3]      *
>>>     [2]     chr1  [ 5,  6]      *
>>>     [3]     chr1  [12, 12]      *
>>>     ---
>>>     seqlengths:
>>>      chr1
>>>        NA
>>>> nearest(query2, subject2)
>>> [1] 1 3 3
>>>> nearest(subject2, query2)
>>> [1] 1 1 3
>>>
>>> Similarly
>>>
>>> nearest(query2, query2)
>>> [1] 2 1 2
>>>
>>> should surely be "1 2 3"? Or have I missed something?
>>>
>>> IRanges is as I would expect:>   nearest(ranges(query2), ranges(subject2))
>>> [1] 1 1 3
>>>
>>>
>>> Jim
>>>
>>>
>>> On 3 July 2012 19:40, Michael Lawrence<lawrence.michael at gene.com>   wrote:
>>>> The way IRanges works is that the self hits are excluded (by default)
>>>> when
>>>> only one argument is passed to nearest(). They should not be excluded
>>>> when
>>>> the same object is passed as both arguments.
>>>>
>>>> Michael
>>>>
>>>> On Tue, Jul 3, 2012 at 7:09 AM, James Perkins<jperkins at biochem.ucl.ac.uk>
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> I read this:
>>>>>
>>>>> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-June/046287.html
>>>>>
>>>>> However, after reading it I'm a little confused as to what the default
>>>>> behaviour should be when using nearest with '*', and also how to get
>>>>> back to what I believe was the previous, "IRanges" style behaviour.
>>>>>
>>>>> Is nearest on GRanges supposed to act like nearest on IRanges, or is
>>>>> it instead supposed to return the nearest neighbour OTHER than the
>>>>> query one?
>>>>>
>>>>> I.e.
>>>>>
>>>>>
>>>>>> query<- IRanges(c(1, 3, 9), c(2, 7, 10))
>>>>>> subject<- IRanges(c(3, 5, 12), c(3, 6, 12))
>>>>>> nearest(query, subject)
>>>>> [1] 1 1 3
>>>>>
>>>>> Because, it seems to be behaving differently, i.e. returning the
>>>>> neighbour only, i.e.
>>>>>
>>>>>
>>>>>> query2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(1, 3, 9),
>>>>>> c(2, 7, 10)))
>>>>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>>>>> c(3, 6, 12)))
>>>>>> nearest(query2, subject2)
>>>>> [1] 1 3 2
>>>>>> sessionInfo()
>>>>> R version 2.15.1 (2012-06-22)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>>    [1] LC_CTYPE=en_US             LC_NUMERIC=C
>>>>>    [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>>>>    [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] GenomicRanges_1.8.7 IRanges_1.14.4      BiocGenerics_0.2.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] stats4_2.15.1
>>>>>
>>>>>
>>>>> However, what makes me confused is that the previous behaviour was like
>>>>> IRanges:
>>>>>
>>>>>> query<- IRanges(c(1, 3, 9), c(2, 7, 10))
>>>>>> subject<- IRanges(c(3, 5, 12), c(3, 6, 12))
>>>>>> nearest(query, subject)
>>>>> [1] 1 1 3
>>>>>> query2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(1, 1, 12),
>>>>>> c(2, 7, 12)))
>>>>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>>>>> c(3, 6, 12)))
>>>>>> nearest(query2, subject2)
>>>>> [1] 1 1 3
>>>>>> sessionInfo()
>>>>> R version 2.15.0 (2012-03-30)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3      BiocGenerics_0.2.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] stats4_2.15.0
>>>>>
>>>>>
>>>>> And in fact, this has always been the behaviour right? i.e.: for R
>>>>> 2.13.1
>>>>>
>>>>>> query<- IRanges(c(1, 3, 9), c(2, 7, 10))
>>>>>> subject<- IRanges(c(3, 5, 12), c(3, 6, 12))
>>>>>> nearest(query, subject)
>>>>> [1] 1 1 3
>>>>>> query2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(1, 1, 12),
>>>>>> c(2, 7, 12)))
>>>>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>>>>> c(3, 6, 12)))
>>>>>> nearest(query2, subject2)
>>>>> [1] 1 1 3
>>>>>> sessionInfo()
>>>>> R version 2.13.1 (2011-07-08)
>>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>    [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] GenomicRanges_1.4.8 IRanges_1.10.6
>>>>>
>>>>>
>>>>> Is ignore.strand=TRUE intended to get the IRanges-like behaviour?
>>>>> Because I have problems with this too:
>>>>>
>>>>>    nearest(x = query2, subject = subject2, ignore.strand=TRUE)
>>>>> Error in strand(x)<- strand(subject)<- "+" : object 'x' not found
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Jim
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>