[R] memory-efficient column aggregation of a sparse matrix

Thu Feb 1 14:22:07 CET 2007

On 1/31/07, Jon Stearley <jrstear at sandia.gov> wrote:
> I need to sum the columns of a sparse matrix according to a factor -
> ie given a sparse matrix X and a factor fac of length ncol(X), sum
> the elements by column factors and return the sparse matrix Y of size
> nrow(X) by nlevels(f).  The appended code does the job, but is
> unacceptably memory-bound because tapply() uses a non-sparse
> representation.  Can anyone suggest a more memory and cpu efficient
> approach?  Eg, a sparse matrix tapply method?  Thanks.

This is the sort of operation that is much more easily performed in
the triplet representation of a sparse matrix where each nonzero
element is represented by its row index, column index and value.
Using that representation you could map the column indices according
to the factor then convert back to one of the other representations.
The only question would be what to do about nonzeros in different
columns of the original matrix that get mapped to the same element in
the result.  It turns out that in the sparse matrix code used by the
Matrix package the triplet representation allows for duplicate index
positions with the convention that the resulting value at a position
is the sum of the values of any triplets with that index pair.

If you decide to use this approach please be aware that the indices
for the triplet representation in the Matrix package are 0-based (as
in C code) not 1-based (as in R code).  (I imagine that Martin is
thinking "we really should change that" as he reads this part.)

>
> --
> +--------------------------------------------------------------+
> | Jon Stearley                  (505) 845-7571  (FAX 844-9297) |
> | Sandia National Laboratories  Scalable Systems Integration   |
> +--------------------------------------------------------------+
>
>
> # x and y are of SparseM class matrix.csr
> "aggregate.csr" <-
> function(x, fac) {
>          # make a vector indicating the row of each nonzero
>          rows <- integer(length=length(x at ra))
>          rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each row
>          rows <- as.integer(cumsum(rows)) # and finish with a cumsum
>
>          # make a vector indicating the column factor of each nonzero
>          f <- fac[x at ja]
>
>          # aggregate by row,f
>          y <- tapply(x at ra, list(rows,f), sum)
>
>          # sparsify it
>          y[is.na(y)] <- 0  # change tapply NAs to as.matrix.csr 0s
>          y <- as.matrix.csr(y)
>
>          y
> }
>