[BioC] subdivideGRanges() to a given number of ranges
Martin Morgan
mtmorgan at fhcrc.org
Thu Sep 6 22:33:40 CEST 2012
On 09/06/2012 11:33 AM, d r wrote:
> Thank you Mike and Valarie,
>
> Just to be sure I understood you clearly: I understand there is no
> possibilty to use subdivideGRanges() to split a GRanges to a given number
> of elements.
> If I understood you correctly, it is also not possible to read a vector
> into the subsize argument. is taht true or is ther some possibilty to do
> that?
Not sure if this is the best way, but for a GRanges instance 'gr'
dividing it in to 'n' bins, you might start by making sure there won't
be any zero-width or necessarily overlapping bins
if (any(width(gr) < n))
stop("all 'width(gr)' must be >= 'n'")
then figure out the width, in numeric values, of each bin
d <- width(gr) / n
To make a vector of offsets from the start of each original bin, you
want to do this in a 'vectorized' fashion, rather than a loop, so I
replicated each element of d n times, calculated the cumulative sum, and
then subtracted for each of the d x n elements the value of the
cumulative sum at the start of each replicate, a little obscure I guess...
dd <- cumsum(rep(d, each=n))
mask <- logical(n); mask[1] <- TRUE # every n'th is base for range
dd <- dd - rep(dd[mask], each=n)
To get the new starts, replicate the original starts and add the
offsets, using round() to approximate to the nearest integer value
starts <- round(rep(start(gr), each=n) + dd)
the ends will be the starts of the next range, minus 1 (the 'L' says to
represent 0 and 1 as integer values, rather than numeric, and avoids
unnecessary coercion from integer to numeric and back)
ends <- c(starts[-1] - 1L, 0L)
except for each of the n'th (including last) elements, which will be the
end of the original range
ends[rev(mask)] <- end(gr)
we can then replicate our original data and update the ranges
gr <- gr[rep(seq_along(gr), each=n)]
ranges(gr) <- IRanges(starts, ends)
getting the result
gr
Here's the above as a function
intoNbins <-
function(gr, n = 10)
{
if (any(width(gr) < n))
stop("all 'width(gr)' must be >= 'n'")
d <- width(gr) / n
dd <- cumsum(rep(d, each=n))
mask <- logical(n); mask[1] <- TRUE
dd <- dd - rep(dd[mask], each=n)
starts <- round(rep(start(gr), each=n) + dd)
ends <- c(starts[-1], 0) - 1L
ends[rev(mask)] <- end(gr)
gr <- gr[rep(seq_along(gr), each=n)]
ranges(gr) <- IRanges(starts, ends)
gr
}
and in action
> gr <- GRanges("A", IRanges(c(1, 21, 21), width=c(10, 20, 30)))
> table(width(intoNbins(gr, 10)))
1 2 3
10 10 10
It might make sense to split the GRanges into a GRangesList, one element
for each CpG island, replacing the last line of intoNbins with
split(gr, rep(seq_along(d), each=n))
Martin
>
> thanks again
> Dolev
>
> On Thu, Sep 6, 2012 at 5:17 PM, Michael Love <love at molgen.mpg.de> wrote:
>
>> Hi Dolev,
>>
>> The subdivideGRanges() function cannot accomplish this task.
>>
>> from the manual:
>>
>> "Takes an input GRanges object and, splits each range into multiple ranges
>> of nearly equal width. For an input range of width w and subdividing size
>> s, it will subdivide the range into max(1,floor(w/s)) nearly equal width
>> ranges."
>>
>> So the function tries to keep the widths of the subdivided ranges around s
>> basepairs, rather than keeping the number of subdivided ranges constant.
>>
>> I will include a more helpful error message if a vector is provided for
>> the subsize argument.
>>
>> cheers,
>>
>> Mike
>>
>>
>>
>> On 09/06/12 16:02, Valerie Obenchain wrote:
>>
>>> Hi Dolev,
>>>
>>> I am copying the maintainer of exomeCopy which is the package
>>> subdivideGRanges() comes from. Please provide a small reproducable example
>>> of the problem you are having (i.e., a few lines of code that someone else
>>> can run to get the same error). This makes it easier for others to help.
>>>
>>> Valerie
>>>
>>> On 09/06/12 04:39, d r wrote:
>>>
>>>> Hello
>>>>
>>>> I have a GRanges object representing CpG islands, that I want to
>>>> subdivide
>>>> so that each island will be represented by 10 ranges of equal width.
>>>>
>>>> To do this, I created a vector containing the widths of the islands
>>>> divided
>>>> by ten:
>>>>
>>>> swidth<-as.integer(width(**islandshg19)/10)
>>>>
>>>>
>>>>
>>>> and then called subdivideGRanges():
>>>>
>>>>
>>>> islandshg19<-subdivideGRanges(**islandshg19,
>>>> subsize=as.integer(width/10))
>>>>
>>>>
>>>> and got a multitude of warnings.
>>>>
>>>>
>>>> This approach did work when I tried it on a GRanges object containing a
>>>> single range, and a width vector of length 1. When I tried to extend the
>>>> sample to include two objetcs, I got these warning message:
>>>>
>>>> 1: In if (widths[i]< 2 * subsize) { ... :
>>>>
>>>> the condition has length> 1 and only the first element will be used
>>>>
>>>> 2: In if (width< 2 * subsize) { ... :
>>>>
>>>> the condition has length> 1 and only the first element will be used
>>>>
>>>> 3: In 0:(nchunks - 1) : numerical expression has 2 elements: only the
>>>> first used
>>>>
>>>>
>>>>
>>>> Which makes me think that the problem may be related to assigning the
>>>> right
>>>> subszie to the the rigth range.
>>>>
>>>> Does anyone have an idea how to do that?
>>>>
>>>> Thanks in advance
>>>>
>>>> Dolev Rahat
>>>>
>>>>
>>>>
>>>> sessionInfo(if required):
>>>>
>>>> R version 2.15.1 (2012-06-22)
>>>>
>>>> Platform: x86_64-pc-mingw32/x64 (64-bit)
>>>>
>>>>
>>>>
>>>> locale:
>>>>
>>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>>> States.1252 LC_MONETARY=English_United States.1252
>>>>
>>>> [4] LC_NUMERIC=C LC_TIME=English_United
>>>> States.1252
>>>>
>>>>
>>>>
>>>> attached base packages:
>>>>
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>>
>>>>
>>>> other attached packages:
>>>>
>>>> [1] exomeCopy_1.2.0 Rsamtools_1.8.6 Biostrings_2.24.1
>>>> rtracklayer_1.16.3 taRifx_1.0.4 reshape2_1.2.1
>>>> plyr_1.7.1
>>>>
>>>> [8] XML_3.9-4.1 GenomicRanges_1.9.54 IRanges_1.15.40
>>>> BiocGenerics_0.3.1 BiocInstaller_1.4.7
>>>>
>>>>
>>>>
>>>> loaded via a namespace (and not attached):
>>>>
>>>> [1] bitops_1.0-4.1 BSgenome_1.25.7 RCurl_1.91-1.1 stats4_2.15.1
>>>> stringr_0.6.1 tools_2.15.1 zlibbioc_1.2.0
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives: http://news.gmane.org/gmane.**
>>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>
>>>
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list