Martin Morgan
mtmorgan at fhcrc.org
Tue Sep 8 18:31:46 CEST 2009
Hi Robin --
Robin Hankin wrote:
> Hi
>
> I deal with long vectors almost all of whose elements are zero.
> Typically, the length will be ~5e7 with ~100 nonzero elements.
>
> I want to deal with these objects using a sort of sparse
> vector.
>
> The problem is that I want to be able to 'add' two such
> vectors.
> Toy problem follows. Suppose I have two such objects, 'a' and 'b':
The Bioconductor package IRanges has an Rle (run length encoding) class
with math. operations defined on it.
## once only, to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
## load library
library(IRanges)
It represents runs encoded by their length, rather than by their ends, so
ree2Rle <- function(ends, values)
{
## untested
idx <- diff(c(0, ends)) - 1L
len <- integer(2*length(idx))
len[c(TRUE, FALSE)] <- idx
len[c(FALSE, TRUE)] <- 1L
val <- vector(typeof(values), 2*length(idx))
val[c(FALSE, TRUE)] <- values
Rle(lengths=len, values=val)
}
Since we're adding vectors, and R has recycling rules, we create Rle's
of the same length (by adding a '0' at the last position of b)
a <- ree2Rle(c(20,30, 10000000), c(2.2,3.3,4.4))
b <- ree2Rle(c(3, 30, length(a)), c(.1, .1, 0))
and then do the math
> system.time(abPlus <- a + b)
user system elapsed
0.000 0.000 0.001
> abPlus
'numeric' Rle instance of length 10000000 with 8 runs
Lengths: 2 1 16 1 9 1 9999969 1
Values : 0 0.1 0 2.2 0 3.4 0 4.4
the ends are
> cumsum(runLength(abPlus))[runValue(abPlus) != 0]
[1] 3 20 30 10000000
and values runValue(abPlus)[runValue(abPlus) != 0]
Martin
>
>
>
>> a
> $index
> [1] 20 30 100000000
>
> $val
> [1] 2.2 3.3 4.4
>
>
>
>> b
> $index
> [1] 3 30
>
> $val
> [1] 0.1 0.1
>
>>
>
>
> What I want is the "sum" of these:
>
>> AplusB
> $index
> [1] 3 20 30 100000000
>
> $val
> [1] 0.1 2.2 3.4 4.4
>
>>
>
>
> See how the value for index=30 (being common to both) is 3.4
> (=3.3+0.1). What's the best R idiom to achieve this?
>
>
>
