[BioC] Problem with nearest() in GenomicRanges

Chris Whelan whelanch at ohsu.edu
Tue Jul 31 05:17:48 CEST 2012


Hi Valerie,

I get this behavior using only a small subset of my data - attached. I am loading it like this:

x.data <- read.table("x_data.txt", header=T, sep="\t")
cpg.island.data <- read.table('cpg_islands_head.gff')

In doing some more testing, it seems like I get this error if not all of the seqnames in the query GRanges are present in the subject GRanges. Is that data not supported?

I tried setting the levels with:

seqlevels(xs) <- seqlevels(cpg.islands)

But then I just get back a vector of NA's from nearest().

Thanks,

Chris




On Jul 30, 2012, at 7:08 PM, Valerie Obenchain wrote:

> Hi Chris,
>
> It's difficult to know what's going on without a reproducible example.
> Can you attach a small subset of your GRanges that still triggers the
> error? If not, is there a place I can get the data for testing?
>
> Valerie
>
> On 07/28/12 12:03, Chris Whelan wrote:
>> Hi all,
>>
>> I'm having an issue with the nearest() method in GenomicRanges. I've
>> created GRanges from two data frames, from data for a species genome
>> assembly with lots of scaffolds:
>>
>>> xs<- with(x.data, GRanges(seqnames=Scaffold, ranges=IRanges(start=Start, end=Stop), strand=Strand))
>>> xs
>> GRanges with 989 ranges and 0 elementMetadata cols:
>>         seqnames               ranges strand
>>            <Rle>             <IRanges>   <Rle>
>>     [1] GL397480 [ 1477346,  1477813]      -
>>     [2] GL397446 [  869823,   870295]      +
>>     ...      ...                  ...    ...
>>   [988] GL397464   [1138154, 1138427]      -
>>   [989] GL397464   [1138154, 1138427]      -
>>   ---
>>   seqlengths:
>>    ADFV01135143 ADFV01136404     GL397262 ...     GL397832     GL397836
>>              NA           NA           NA ...           NA           NA
>>
>>> cpg.islands<- with(cpg.island.data, GRanges(seqnames=V1, ranges=IRanges(start=V4, end=V5), strand='*'))
>>> cpg.islands
>> GRanges with 88968 ranges and 0 elementMetadata cols:
>>               seqnames           ranges strand
>>                  <Rle>         <IRanges>   <Rle>
>>       [1]     GL397261 [  4409,   5062]      *
>>       [2]     GL397261 [ 66765,  67010]      *
>>       ...          ...              ...    ...
>>   [88967] ADFV01162706     [1834, 2052]      *
>>   [88968] ADFV01162751     [ 226,  884]      *
>>   ---
>>   seqlengths:
>>    ADFV01134425 ADFV01134436 ADFV01134442 ...     GL405215     GL405216
>>              NA           NA           NA ...           NA           NA
>>
>> And I'd like to use nearest to find the closest island to each x:
>>
>>> nearest(xs, cpg.islands)
>> Error in Rle(values = callGeneric(runValue(e1)[which1],
>> runValue(e2)[which2]),  :
>>   error in evaluating the argument 'values' in selecting a method for
>> function 'Rle': Error in Ops.factor(runValue(e1)[which1],
>> runValue(e2)[which2]) :
>>   level sets of factors are different
>>> traceback()
>> 12: Rle(values = callGeneric(runValue(e1)[which1], runValue(e2)[which2]),
>>         lengths = diffWithInitialZero(ends), check = FALSE)
>> 11: seqnames(x) != seqnames(y)
>> 10: seqnames(x) != seqnames(y)
>> 9: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA)
>> 8: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA)
>> 7: .local(x, y, ...)
>> 6: distance(x[midx], subject[p[midx]])
>> 5: distance(x[midx], subject[p[midx]])
>> 4: .nearest(x, subject, ignore.strand, ...)
>> 3: .local(x, subject, ...)
>> 2: nearest(xs, cpg.islands)
>> 1: nearest(xs, cpg.islands)
>>
>> This same data works fine with other methods like findOverlaps():
>>
>>> findOverlaps(xs, cpg.islands)
>> Hits of length 225
>> queryLength: 989
>> subjectLength: 88968
>>     queryHits subjectHits
>>      <integer>    <integer>
>>  1           2       75585
>>  2           6       36857
>>  ...       ...         ...
>>  224       963       84872
>>  225       964       84872
>>
>> Any idea what I might be doing wrong? Here's my sessionInfo():
>>
>>> sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>>  [1] LC_CTYPE=C                 LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
>>  [5] LC_MONETARY=C              LC_MESSAGES=C
>>  [7] LC_PAPER=C                 LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>>  [1] locfit_1.5-8          GenomicFeatures_1.8.2 AnnotationDbi_1.18.1
>>  [4] Biobase_2.16.0        multicore_0.1-7       rtracklayer_1.16.3
>>  [7] GenomicRanges_1.8.9   IRanges_1.14.4        BiocGenerics_0.2.0
>> [10] reshape2_1.2.1        ggplot2_0.9.1         BiocInstaller_1.4.7
>>
>> loaded via a namespace (and not attached):
>>  [1] BSgenome_1.24.0    Biostrings_2.24.1  DBI_0.2-5          MASS_7.3-19
>>  [5] RColorBrewer_1.0-5 RCurl_1.91-1       RSQLite_0.11.1     Rsamtools_1.8.5
>>  [9] XML_3.9-4          biomaRt_2.12.0     bitops_1.0-4.1     colorspace_1.1-1
>> [13] compiler_2.15.1    dichromat_1.2-4    digest_0.5.2       grid_2.15.1
>> [17] labeling_0.1       lattice_0.20-6     memoise_0.1        munsell_0.3
>> [21] plyr_1.7.1         proto_0.3-9.2      scales_0.2.1       stats4_2.15.1
>> [25] stringr_0.6.1      tools_2.15.1       zlibbioc_1.2.0
>>
>> Thanks in advance,
>>
>> Chris
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list