[BioC] Problem with nearest() in GenomicRanges
Chris Whelan
whelanch at ohsu.edu
Tue Jul 31 05:17:48 CEST 2012
Hi Valerie,
I get this behavior using only a small subset of my data - attached. I am loading it like this:
x.data <- read.table("x_data.txt", header=T, sep="\t")
cpg.island.data <- read.table('cpg_islands_head.gff')
In doing some more testing, it seems like I get this error if not all of the seqnames in the query GRanges are present in the subject GRanges. Is that data not supported?
I tried setting the levels with:
seqlevels(xs) <- seqlevels(cpg.islands)
But then I just get back a vector of NA's from nearest().
Thanks,
Chris
On Jul 30, 2012, at 7:08 PM, Valerie Obenchain wrote:
> Hi Chris,
>
> It's difficult to know what's going on without a reproducible example.
> Can you attach a small subset of your GRanges that still triggers the
> error? If not, is there a place I can get the data for testing?
>
> Valerie
>
> On 07/28/12 12:03, Chris Whelan wrote:
>> Hi all,
>>
>> I'm having an issue with the nearest() method in GenomicRanges. I've
>> created GRanges from two data frames, from data for a species genome
>> assembly with lots of scaffolds:
>>
>>> xs<- with(x.data, GRanges(seqnames=Scaffold, ranges=IRanges(start=Start, end=Stop), strand=Strand))
>>> xs
>> GRanges with 989 ranges and 0 elementMetadata cols:
>> seqnames ranges strand
>> <Rle> <IRanges> <Rle>
>> [1] GL397480 [ 1477346, 1477813] -
>> [2] GL397446 [ 869823, 870295] +
>> ... ... ... ...
>> [988] GL397464 [1138154, 1138427] -
>> [989] GL397464 [1138154, 1138427] -
>> ---
>> seqlengths:
>> ADFV01135143 ADFV01136404 GL397262 ... GL397832 GL397836
>> NA NA NA ... NA NA
>>
>>> cpg.islands<- with(cpg.island.data, GRanges(seqnames=V1, ranges=IRanges(start=V4, end=V5), strand='*'))
>>> cpg.islands
>> GRanges with 88968 ranges and 0 elementMetadata cols:
>> seqnames ranges strand
>> <Rle> <IRanges> <Rle>
>> [1] GL397261 [ 4409, 5062] *
>> [2] GL397261 [ 66765, 67010] *
>> ... ... ... ...
>> [88967] ADFV01162706 [1834, 2052] *
>> [88968] ADFV01162751 [ 226, 884] *
>> ---
>> seqlengths:
>> ADFV01134425 ADFV01134436 ADFV01134442 ... GL405215 GL405216
>> NA NA NA ... NA NA
>>
>> And I'd like to use nearest to find the closest island to each x:
>>
>>> nearest(xs, cpg.islands)
>> Error in Rle(values = callGeneric(runValue(e1)[which1],
>> runValue(e2)[which2]), :
>> error in evaluating the argument 'values' in selecting a method for
>> function 'Rle': Error in Ops.factor(runValue(e1)[which1],
>> runValue(e2)[which2]) :
>> level sets of factors are different
>>> traceback()
>> 12: Rle(values = callGeneric(runValue(e1)[which1], runValue(e2)[which2]),
>> lengths = diffWithInitialZero(ends), check = FALSE)
>> 11: seqnames(x) != seqnames(y)
>> 10: seqnames(x) != seqnames(y)
>> 9: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA)
>> 8: `seqselect<-`(`*tmp*`, seqnames(x) != seqnames(y), value = NA)
>> 7: .local(x, y, ...)
>> 6: distance(x[midx], subject[p[midx]])
>> 5: distance(x[midx], subject[p[midx]])
>> 4: .nearest(x, subject, ignore.strand, ...)
>> 3: .local(x, subject, ...)
>> 2: nearest(xs, cpg.islands)
>> 1: nearest(xs, cpg.islands)
>>
>> This same data works fine with other methods like findOverlaps():
>>
>>> findOverlaps(xs, cpg.islands)
>> Hits of length 225
>> queryLength: 989
>> subjectLength: 88968
>> queryHits subjectHits
>> <integer> <integer>
>> 1 2 75585
>> 2 6 36857
>> ... ... ...
>> 224 963 84872
>> 225 964 84872
>>
>> Any idea what I might be doing wrong? Here's my sessionInfo():
>>
>>> sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=C LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
>> [5] LC_MONETARY=C LC_MESSAGES=C
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] locfit_1.5-8 GenomicFeatures_1.8.2 AnnotationDbi_1.18.1
>> [4] Biobase_2.16.0 multicore_0.1-7 rtracklayer_1.16.3
>> [7] GenomicRanges_1.8.9 IRanges_1.14.4 BiocGenerics_0.2.0
>> [10] reshape2_1.2.1 ggplot2_0.9.1 BiocInstaller_1.4.7
>>
>> loaded via a namespace (and not attached):
>> [1] BSgenome_1.24.0 Biostrings_2.24.1 DBI_0.2-5 MASS_7.3-19
>> [5] RColorBrewer_1.0-5 RCurl_1.91-1 RSQLite_0.11.1 Rsamtools_1.8.5
>> [9] XML_3.9-4 biomaRt_2.12.0 bitops_1.0-4.1 colorspace_1.1-1
>> [13] compiler_2.15.1 dichromat_1.2-4 digest_0.5.2 grid_2.15.1
>> [17] labeling_0.1 lattice_0.20-6 memoise_0.1 munsell_0.3
>> [21] plyr_1.7.1 proto_0.3-9.2 scales_0.2.1 stats4_2.15.1
>> [25] stringr_0.6.1 tools_2.15.1 zlibbioc_1.2.0
>>
>> Thanks in advance,
>>
>> Chris
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list