[R] What exactly is an dgCMatrix-class. There are so many attributes.

Martin Maechler maechler at stat.math.ethz.ch
Sat Oct 21 19:13:45 CEST 2017


>>>>> David Winsemius <dwinsemius at comcast.net>
>>>>>     on Sat, 21 Oct 2017 09:05:38 -0700 writes:

    >> On Oct 21, 2017, at 7:50 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
    >> 
    >>>>>>> C W <tmrsg11 at gmail.com>
    >>>>>>> on Fri, 20 Oct 2017 15:51:16 -0400 writes:
    >> 
    >>> Thank you for your responses.  I guess I don't feel
    >>> alone. I don't find the documentation go into any detail.
    >> 
    >>> I also find it surprising that,
    >> 
    >>>> object.size(train$data)
    >>> 1730904 bytes
    >> 
    >>>> object.size(as.matrix(train$data))
    >>> 6575016 bytes
    >> 
    >>> the dgCMatrix actually takes less memory, though it
    >>> *looks* like the opposite.
    >> 
    >> to whom?
    >> 
    >> The whole idea of these sparse matrix classes in the 'Matrix'
    >> package (and everywhere else in applied math, CS, ...) is that
    >> 1. they need  much less memory   and
    >> 2. matrix arithmetic with them can be much faster because it is based on
    >> sophisticated sparse matrix linear algebra, notably the
    >> sparse Cholesky decomposition for solve() etc.
    >> 
    >> Of course the efficency only applies if most of the
    >> matrix entries _are_ 0.
    >> You can measure the  "sparsity" or rather the  "density", of a
    >> matrix by
    >> 
    >> nnzero(A) / length(A)
    >> 
    >> where length(A) == nrow(A) * ncol(A)  as for regular matrices
    >> (but it does *not* integer overflow)
    >> and nnzero(.) is a simple utility from Matrix
    >> which -- very efficiently for sparseMatrix objects -- gives the
    >> number of nonzero entries of the matrix.
    >> 
    >> All of these classes are formally defined classes and have
    >> therefore help pages.  Here  ?dgCMatrix-class  which then points
    >> to  ?CsparseMatrix-class  (and I forget if Rstudio really helps
    >> you find these ..; in emacs ESS they are found nicely via the usual key)
    >> 
    >> To get started, you may further look at  ?Matrix _and_  ?sparseMatrix
    >> (and possibly the Matrix package vignettes --- though they need
    >> work -- I'm happy for collaborators there !)
    >> 
    >> Bill Dunlap's comment applies indeed:
    >> In principle all these matrices should work like regular numeric
    >> matrices, just faster with less memory foot print if they are
    >> really sparse (and not just formally of a sparseMatrix class)
    >> ((and there are quite a few more niceties in the package))
    >> 
    >> Martin Maechler
    >> (here, maintainer of 'Matrix')
    >> 
    >> 
    >>> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net>
    >>> wrote:
    >> 
    >>>>> On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote:
    >>>>> 
    >>>>> Dear R list,
    >>>>> 
    >>>>> I came across dgCMatrix. I believe this class is associated with sparse
    >>>>> matrix.
    >>>> 
    >>>> Yes. See:
    >>>> 
    >>>> help('dgCMatrix-class', pack=Matrix)
    >>>> 
    >>>> If Martin Maechler happens to respond to this you should listen to him
    >>>> rather than anything I write. Much of what the Matrix package does appears
    >>>> to be magical to one such as I.
    >>>> 
[............]    

    >>>>> data(agaricus.train, package='xgboost')
    >>>>> train <- agaricus.train
    >>>>> names( attributes(train$data) )
    >>>> [1] "i"        "p"        "Dim"      "Dimnames" "x"        "factors"
    >>>> "class"
    >>>>> str(train$data)
    >>>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
    >>>> ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
    >>>> ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991
    >>>> ...
    >>>> ..@ Dim     : int [1:2] 6513 126
    >>>> ..@ Dimnames:List of 2
    >>>> .. ..$ : NULL
    >>>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical"
    >>>> "cap-shape=convex" "cap-shape=flat" ...
    >>>> ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
    >>>> ..@ factors : list()
    >>>> 
    >>>>> Where is the data, is it in $p, $i, or $x?
    >>>> 
    >>>> So the "data" (meaning the values of the sparse matrix) are in the @x
    >>>> leaf. The values all appear to be the number 1. The @i leaf is the sequence
    >>>> of row locations for the values entries while the @p items are somehow
    >>>> connected with the columns (I think, since 127 and 126=number of columns
    >>>> from the @Dim leaf are only off by 1).
    >> 
    >> You are right David.
    >> 
    >> well, they follow sparse matrix standards which (like C) start
    >> counting at 0.
    >> 
    >>>> 
    >>>> Doing this > colSums(as.matrix(train$data))
    >> 
    >> The above colSums() again is "very" inefficient:
    >> All such R functions  have smartly defined  Matrix methods that
    >> directly work on sparse matrices.

    > I did get an error with colSums(train$data)

    >> colSums(train$data)
    > Error in colSums(train$data) : 
    > 'x' must be an array of at least two dimensions

The same problem  C.W. saw with head()

It, e.g., all works after calling  str() on train$data.

But I am still puzzled, because head() is similar to str():
both are S3 generics (in "utils")  but str()'s useMethod() I
think see that the class belongs to package "Matrix" and hence
attaches it {not just *load* it -- hence, import etc does not matter}.
but  head() does not.

Even more curiously,  colSums()  *also*  attaches Matrix but
still fails, but it works on a 2nd call 

Example 1, in a fresh R session:

--------------------------------------------------------------------------------

> data(agaricus.train, package="xgboost")
> M <- agaricus.train$data
> methods(str)

[1] str.data.frame* str.Date*       str.default*    str.dendrogram* str.logLik*     str.POSIXt*    
# see '?methods' for accessing help and source code

> str(M)
Loading required package: Matrix  <<<<<<<<< SEE ! <<<<<<<<<<<<<<<<<
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
  ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
  ..@ Dim     : int [1:2] 6513 126
  ..@ Dimnames:List of 2
  .. ..$ : NULL
R  .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
  ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()
> 
> head(M)
6 x 126 sparse Matrix of class "dgCMatrix"
   [[ suppressing 126 column names ‘cap-shape=bell’, ‘cap-shape=conical’, ‘cap-shape=convex’ ... ]]
[1,] . . 1 . . . . . . 1 1 . . . . . . . . . 1 . . . . . . . . 1 . . . 1 . 1 . . . 1 1 . . . . . . . . . . . 1 . .
................
................
................

---------------------------------------------------------------------------------
-

See, str()  is a nice one generic function ==> attaches Matrix (see
the message where I have added '<<<<<<<<< SEE ! <<<<.........'),
but as we know  head() does not strangely.

Now, the curious  colSums() behavior:

Example 2, in a fresh R session:
-----------------------------------------------------------------------------
> data(agaricus.train, package='xgboost')
> M <- agaricus.train$data
> cm <- colSums(M) ## first time, loads Matrix but then fails !!
Loading required package: Matrix
Error in colSums(M) : 'x' must be an array of at least two dimensions
> cm <- colSums(M) ## 2nd time, works because Matrix methods are all there
> str(cm)
 Named num [1:126] 369 3 2934 2539 644 ...
 - attr(*, "names")= chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
> 
-----------------------------------------------------------------------------

    > Which as it turned out was due to my having not yet loaded pkg:Matrix. Perhaps the xgboost package only imports certain functions from pkg:Matrix and that colSums is not one of them. This resembles the errors I get when I try to use grip package functions on ggplot2 objects. Since ggplot2 is built on top of grid I always am surprised when this happens and after a headslap and explicitly loading pfk:grid I continue on my stumbling way.


    > library(Matrix)
    > colSums(train$data)   # no error


    >> Note that  as.matrix(M)  can "blow up" your R, when the matrix M
    >> is really large and sparse such that its dense version does not
    >> even fit in your computer's RAM.

    > I did know that, so I first calculated whether the dense matrix version of that object would fit in my RAM space and it fit easily so I proceeded. 

    > I find the TsparseMatrix indexing easier for my more naive notion of sparsity, although thinking about it now,  I think I can see that the CsparseMatrix more closely resembles the "folded vector" design of dense R matrices. I will sometimes coerce CMatrix objeccts to TMatrix objects if I am working on the "inner" indices. I should probably stop doing that.

Well, it depends if speed and efficiency are the only important
issues.
The triplet representation (<==> TsparseMatrix)  is of course
much easier to understand and explain than the column-compressed
one (CsparseMatrix) -- but the latter is the one that is
efficiently used in the C-level libraries for matrix
multiplication, Cholesky etc.


    > I sincerely hope my stumbling efforts have not caused any delays.

Not at all,  thank you David for all your helping on R-help !!!
Martin

    > -- 
    > David.

    [..................]


    > David Winsemius
    > Alameda, CA, USA

    > 'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law

ok.... given your other statement,  it may be that  Matrix  *is*
sufficiently adanced  ;-) :-)



More information about the R-help mailing list