[R] memory-efficient column aggregation of a sparse matrix

roger koenker rkoenker at uiuc.edu
Thu Feb 1 14:30:48 CET 2007

Doug is right, I think, that this would be easier with full indexing
using the  matrix.coo classe, if you want to use SparseM.  But
then the tapply seems to be the way to go.

url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    rkoenker at uiuc.edu            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820

On Feb 1, 2007, at 7:22 AM, Douglas Bates wrote:

> On 1/31/07, Jon Stearley <jrstear at sandia.gov> wrote:
>> I need to sum the columns of a sparse matrix according to a factor -
>> ie given a sparse matrix X and a factor fac of length ncol(X), sum
>> the elements by column factors and return the sparse matrix Y of size
>> nrow(X) by nlevels(f).  The appended code does the job, but is
>> unacceptably memory-bound because tapply() uses a non-sparse
>> representation.  Can anyone suggest a more memory and cpu efficient
>> approach?  Eg, a sparse matrix tapply method?  Thanks.
> This is the sort of operation that is much more easily performed in
> the triplet representation of a sparse matrix where each nonzero
> element is represented by its row index, column index and value.
> Using that representation you could map the column indices according
> to the factor then convert back to one of the other representations.
> The only question would be what to do about nonzeros in different
> columns of the original matrix that get mapped to the same element in
> the result.  It turns out that in the sparse matrix code used by the
> Matrix package the triplet representation allows for duplicate index
> positions with the convention that the resulting value at a position
> is the sum of the values of any triplets with that index pair.
> If you decide to use this approach please be aware that the indices
> for the triplet representation in the Matrix package are 0-based (as
> in C code) not 1-based (as in R code).  (I imagine that Martin is
> thinking "we really should change that" as he reads this part.)
>> --
>> +--------------------------------------------------------------+
>> | Jon Stearley                  (505) 845-7571  (FAX 844-9297) |
>> | Sandia National Laboratories  Scalable Systems Integration   |
>> +--------------------------------------------------------------+
>> # x and y are of SparseM class matrix.csr
>> "aggregate.csr" <-
>> function(x, fac) {
>>          # make a vector indicating the row of each nonzero
>>          rows <- integer(length=length(x at ra))
>>          rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each row
>>          rows <- as.integer(cumsum(rows)) # and finish with a cumsum
>>          # make a vector indicating the column factor of each nonzero
>>          f <- fac[x at ja]
>>          # aggregate by row,f
>>          y <- tapply(x at ra, list(rows,f), sum)
>>          # sparsify it
>>          y[is.na(y)] <- 0  # change tapply NAs to as.matrix.csr 0s
>>          y <- as.matrix.csr(y)
>>          y
>> }

More information about the R-help mailing list