[BioC] GenomicRanges: nearest() for GRanges not returning overlaps

James Perkins jperkins at biochem.ucl.ac.uk
Wed Jul 4 11:12:36 CEST 2012


Thanks a lot Valerie!

So precede with "*" is analogous to nearest? Since the nearest will
also be preceding it either + or - direction?

Jim

On 3 July 2012 21:04, Valerie Obenchain <vobencha at fhcrc.org> wrote:
> Hi James,
>
> Thanks for the bug report. The nearest,GRanges method is currently not
> selecting any range that includes itself. It should behave like IRanges and
> excluded self hits when one argument is supplied but not when the same
> argument is both query and subject. I'll post back here when it is fixed.
>
> Here is an example that will help explain how '*' is treated as far as
> matching. For precede(), follow() and nearest() the '*' strand is treated as
> '+' and '-' when looking for potential matches.
>
>> q <- GRanges("chr1", IRanges(c(1, 3, 9), c(2, 7, 10)))
>
>> q
> GRanges with 3 ranges and 0 elementMetadata cols:
>       seqnames    ranges strand
> <Rle> <IRanges> <Rle>
>   [1]     chr1   [1,  2]      *
>   [2]     chr1   [3,  7]      *
>   [3]     chr1   [9, 10]      *
>   ---
>   seqlengths:
>    chr1
>      NA
>
> ## The second range could be seen as preceding range 3 on '+'
> ## or preceding range 1 on '-'. Range 1 is closer so it is chosen.
> ## The third range precedes no range on '+' but precedes
> ## range 2 on '-' so range 2 is chosen.
>> precede(q)
> [1] 2 1 2
>> precede(ranges(q))
> [1]  2  3 NA
>
> ## When the strand is '+' GRanges and IRanges behave the same.
>> strand(q) <- "+"
>> precede(q)
> [1]  2  3 NA
>> precede(ranges(q))
> [1]  2  3 NA
>
> ## When the strand is '-' GRanges treats the ranges as '-' but
> ## IRanges is unstranded so they are treated as '+'.
>> strand(q) <- "-"
>> precede(q)
> [1] NA  1  2
>> precede(ranges(q))
> [1]  2  3 NA
>
> Valerie
>
>
> On 07/03/12 10:58, James Perkins wrote:
>>
>> Thanks,
>>
>> But it appears in the example (using GenomicRanges_1.8.7), the second
>> range is being excluded? I.e. 3-7 matches 5-6 (and 1-3) and thus is
>>
>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>
>> +>>  >  c(3, 6, 12)))
>>>
>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>> c(3, 6, 12)))
>>> query2
>>
>> GRanges with 3 ranges and 0 elementMetadata cols:
>>        seqnames    ranges strand
>>           <Rle>  <IRanges>   <Rle>
>>    [1]     chr1   [1,  2]      *
>>    [2]     chr1   [3,  7]      *
>>    [3]     chr1   [9, 10]      *
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>>
>>> subject2
>>
>> GRanges with 3 ranges and 0 elementMetadata cols:
>>        seqnames    ranges strand
>>           <Rle>  <IRanges>   <Rle>
>>    [1]     chr1  [ 3,  3]      *
>>    [2]     chr1  [ 5,  6]      *
>>    [3]     chr1  [12, 12]      *
>>    ---
>>    seqlengths:
>>     chr1
>>       NA
>>>
>>> nearest(query2, subject2)
>>
>> [1] 1 3 3
>>>
>>> nearest(subject2, query2)
>>
>> [1] 1 1 3
>>
>> Similarly
>>
>> nearest(query2, query2)
>> [1] 2 1 2
>>
>> should surely be "1 2 3"? Or have I missed something?
>>
>> IRanges is as I would expect:>  nearest(ranges(query2), ranges(subject2))
>> [1] 1 1 3
>>
>>
>> Jim
>>
>>
>> On 3 July 2012 19:40, Michael Lawrence<lawrence.michael at gene.com>  wrote:
>>>
>>> The way IRanges works is that the self hits are excluded (by default)
>>> when
>>> only one argument is passed to nearest(). They should not be excluded
>>> when
>>> the same object is passed as both arguments.
>>>
>>> Michael
>>>
>>> On Tue, Jul 3, 2012 at 7:09 AM, James Perkins<jperkins at biochem.ucl.ac.uk>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I read this:
>>>>
>>>> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-June/046287.html
>>>>
>>>> However, after reading it I'm a little confused as to what the default
>>>> behaviour should be when using nearest with '*', and also how to get
>>>> back to what I believe was the previous, "IRanges" style behaviour.
>>>>
>>>> Is nearest on GRanges supposed to act like nearest on IRanges, or is
>>>> it instead supposed to return the nearest neighbour OTHER than the
>>>> query one?
>>>>
>>>> I.e.
>>>>
>>>>
>>>>> query<- IRanges(c(1, 3, 9), c(2, 7, 10))
>>>>> subject<- IRanges(c(3, 5, 12), c(3, 6, 12))
>>>>> nearest(query, subject)
>>>>
>>>> [1] 1 1 3
>>>>
>>>> Because, it seems to be behaving differently, i.e. returning the
>>>> neighbour only, i.e.
>>>>
>>>>
>>>>> query2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(1, 3, 9),
>>>>> c(2, 7, 10)))
>>>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>>>> c(3, 6, 12)))
>>>>> nearest(query2, subject2)
>>>>
>>>> [1] 1 3 2
>>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 2.15.1 (2012-06-22)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>   [1] LC_CTYPE=en_US             LC_NUMERIC=C
>>>>   [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>>>   [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>>>   [7] LC_PAPER=C                 LC_NAME=C
>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] GenomicRanges_1.8.7 IRanges_1.14.4      BiocGenerics_0.2.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] stats4_2.15.1
>>>>
>>>>
>>>> However, what makes me confused is that the previous behaviour was like
>>>> IRanges:
>>>>
>>>>> query<- IRanges(c(1, 3, 9), c(2, 7, 10))
>>>>> subject<- IRanges(c(3, 5, 12), c(3, 6, 12))
>>>>> nearest(query, subject)
>>>>
>>>> [1] 1 1 3
>>>>>
>>>>> query2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(1, 1, 12),
>>>>> c(2, 7, 12)))
>>>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>>>> c(3, 6, 12)))
>>>>> nearest(query2, subject2)
>>>>
>>>> [1] 1 1 3
>>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 2.15.0 (2012-03-30)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>   [7] LC_PAPER=C                 LC_NAME=C
>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] GenomicRanges_1.8.6 IRanges_1.14.3      BiocGenerics_0.2.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] stats4_2.15.0
>>>>
>>>>
>>>> And in fact, this has always been the behaviour right? i.e.: for R
>>>> 2.13.1
>>>>
>>>>> query<- IRanges(c(1, 3, 9), c(2, 7, 10))
>>>>> subject<- IRanges(c(3, 5, 12), c(3, 6, 12))
>>>>> nearest(query, subject)
>>>>
>>>> [1] 1 1 3
>>>>>
>>>>> query2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(1, 1, 12),
>>>>> c(2, 7, 12)))
>>>>> subject2<- GRanges(seqnames=rep('chr1',3),ranges=IRanges(c(3, 5, 12),
>>>>> c(3, 6, 12)))
>>>>> nearest(query2, subject2)
>>>>
>>>> [1] 1 1 3
>>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 2.13.1 (2011-07-08)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] GenomicRanges_1.4.8 IRanges_1.10.6
>>>>
>>>>
>>>> Is ignore.strand=TRUE intended to get the IRanges-like behaviour?
>>>> Because I have problems with this too:
>>>>
>>>>   nearest(x = query2, subject = subject2, ignore.strand=TRUE)
>>>> Error in strand(x)<- strand(subject)<- "+" : object 'x' not found
>>>>
>>>> Thanks!
>>>>
>>>> Jim
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list