[R] What exactly is an dgCMatrix-class. There are so many attributes.
Martin Maechler
maechler at stat.math.ethz.ch
Sat Oct 21 19:13:45 CEST 2017
>>>>> David Winsemius <dwinsemius at comcast.net>
>>>>> on Sat, 21 Oct 2017 09:05:38 -0700 writes:
>> On Oct 21, 2017, at 7:50 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
>>
>>>>>>> C W <tmrsg11 at gmail.com>
>>>>>>> on Fri, 20 Oct 2017 15:51:16 -0400 writes:
>>
>>> Thank you for your responses. I guess I don't feel
>>> alone. I don't find the documentation go into any detail.
>>
>>> I also find it surprising that,
>>
>>>> object.size(train$data)
>>> 1730904 bytes
>>
>>>> object.size(as.matrix(train$data))
>>> 6575016 bytes
>>
>>> the dgCMatrix actually takes less memory, though it
>>> *looks* like the opposite.
>>
>> to whom?
>>
>> The whole idea of these sparse matrix classes in the 'Matrix'
>> package (and everywhere else in applied math, CS, ...) is that
>> 1. they need much less memory and
>> 2. matrix arithmetic with them can be much faster because it is based on
>> sophisticated sparse matrix linear algebra, notably the
>> sparse Cholesky decomposition for solve() etc.
>>
>> Of course the efficency only applies if most of the
>> matrix entries _are_ 0.
>> You can measure the "sparsity" or rather the "density", of a
>> matrix by
>>
>> nnzero(A) / length(A)
>>
>> where length(A) == nrow(A) * ncol(A) as for regular matrices
>> (but it does *not* integer overflow)
>> and nnzero(.) is a simple utility from Matrix
>> which -- very efficiently for sparseMatrix objects -- gives the
>> number of nonzero entries of the matrix.
>>
>> All of these classes are formally defined classes and have
>> therefore help pages. Here ?dgCMatrix-class which then points
>> to ?CsparseMatrix-class (and I forget if Rstudio really helps
>> you find these ..; in emacs ESS they are found nicely via the usual key)
>>
>> To get started, you may further look at ?Matrix _and_ ?sparseMatrix
>> (and possibly the Matrix package vignettes --- though they need
>> work -- I'm happy for collaborators there !)
>>
>> Bill Dunlap's comment applies indeed:
>> In principle all these matrices should work like regular numeric
>> matrices, just faster with less memory foot print if they are
>> really sparse (and not just formally of a sparseMatrix class)
>> ((and there are quite a few more niceties in the package))
>>
>> Martin Maechler
>> (here, maintainer of 'Matrix')
>>
>>
>>> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net>
>>> wrote:
>>
>>>>> On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote:
>>>>>
>>>>> Dear R list,
>>>>>
>>>>> I came across dgCMatrix. I believe this class is associated with sparse
>>>>> matrix.
>>>>
>>>> Yes. See:
>>>>
>>>> help('dgCMatrix-class', pack=Matrix)
>>>>
>>>> If Martin Maechler happens to respond to this you should listen to him
>>>> rather than anything I write. Much of what the Matrix package does appears
>>>> to be magical to one such as I.
>>>>
[............]
>>>>> data(agaricus.train, package='xgboost')
>>>>> train <- agaricus.train
>>>>> names( attributes(train$data) )
>>>> [1] "i" "p" "Dim" "Dimnames" "x" "factors"
>>>> "class"
>>>>> str(train$data)
>>>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
>>>> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
>>>> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991
>>>> ...
>>>> ..@ Dim : int [1:2] 6513 126
>>>> ..@ Dimnames:List of 2
>>>> .. ..$ : NULL
>>>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical"
>>>> "cap-shape=convex" "cap-shape=flat" ...
>>>> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
>>>> ..@ factors : list()
>>>>
>>>>> Where is the data, is it in $p, $i, or $x?
>>>>
>>>> So the "data" (meaning the values of the sparse matrix) are in the @x
>>>> leaf. The values all appear to be the number 1. The @i leaf is the sequence
>>>> of row locations for the values entries while the @p items are somehow
>>>> connected with the columns (I think, since 127 and 126=number of columns
>>>> from the @Dim leaf are only off by 1).
>>
>> You are right David.
>>
>> well, they follow sparse matrix standards which (like C) start
>> counting at 0.
>>
>>>>
>>>> Doing this > colSums(as.matrix(train$data))
>>
>> The above colSums() again is "very" inefficient:
>> All such R functions have smartly defined Matrix methods that
>> directly work on sparse matrices.
> I did get an error with colSums(train$data)
>> colSums(train$data)
> Error in colSums(train$data) :
> 'x' must be an array of at least two dimensions
The same problem C.W. saw with head()
It, e.g., all works after calling str() on train$data.
But I am still puzzled, because head() is similar to str():
both are S3 generics (in "utils") but str()'s useMethod() I
think see that the class belongs to package "Matrix" and hence
attaches it {not just *load* it -- hence, import etc does not matter}.
but head() does not.
Even more curiously, colSums() *also* attaches Matrix but
still fails, but it works on a 2nd call
Example 1, in a fresh R session:
--------------------------------------------------------------------------------
> data(agaricus.train, package="xgboost")
> M <- agaricus.train$data
> methods(str)
[1] str.data.frame* str.Date* str.default* str.dendrogram* str.logLik* str.POSIXt*
# see '?methods' for accessing help and source code
> str(M)
Loading required package: Matrix <<<<<<<<< SEE ! <<<<<<<<<<<<<<<<<
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
..@ Dim : int [1:2] 6513 126
..@ Dimnames:List of 2
.. ..$ : NULL
R .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
..@ factors : list()
>
> head(M)
6 x 126 sparse Matrix of class "dgCMatrix"
[[ suppressing 126 column names ‘cap-shape=bell’, ‘cap-shape=conical’, ‘cap-shape=convex’ ... ]]
[1,] . . 1 . . . . . . 1 1 . . . . . . . . . 1 . . . . . . . . 1 . . . 1 . 1 . . . 1 1 . . . . . . . . . . . 1 . .
................
................
................
---------------------------------------------------------------------------------
-
See, str() is a nice one generic function ==> attaches Matrix (see
the message where I have added '<<<<<<<<< SEE ! <<<<.........'),
but as we know head() does not strangely.
Now, the curious colSums() behavior:
Example 2, in a fresh R session:
-----------------------------------------------------------------------------
> data(agaricus.train, package='xgboost')
> M <- agaricus.train$data
> cm <- colSums(M) ## first time, loads Matrix but then fails !!
Loading required package: Matrix
Error in colSums(M) : 'x' must be an array of at least two dimensions
> cm <- colSums(M) ## 2nd time, works because Matrix methods are all there
> str(cm)
Named num [1:126] 369 3 2934 2539 644 ...
- attr(*, "names")= chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
>
-----------------------------------------------------------------------------
> Which as it turned out was due to my having not yet loaded pkg:Matrix. Perhaps the xgboost package only imports certain functions from pkg:Matrix and that colSums is not one of them. This resembles the errors I get when I try to use grip package functions on ggplot2 objects. Since ggplot2 is built on top of grid I always am surprised when this happens and after a headslap and explicitly loading pfk:grid I continue on my stumbling way.
> library(Matrix)
> colSums(train$data) # no error
>> Note that as.matrix(M) can "blow up" your R, when the matrix M
>> is really large and sparse such that its dense version does not
>> even fit in your computer's RAM.
> I did know that, so I first calculated whether the dense matrix version of that object would fit in my RAM space and it fit easily so I proceeded.
> I find the TsparseMatrix indexing easier for my more naive notion of sparsity, although thinking about it now, I think I can see that the CsparseMatrix more closely resembles the "folded vector" design of dense R matrices. I will sometimes coerce CMatrix objeccts to TMatrix objects if I am working on the "inner" indices. I should probably stop doing that.
Well, it depends if speed and efficiency are the only important
issues.
The triplet representation (<==> TsparseMatrix) is of course
much easier to understand and explain than the column-compressed
one (CsparseMatrix) -- but the latter is the one that is
efficiently used in the C-level libraries for matrix
multiplication, Cholesky etc.
> I sincerely hope my stumbling efforts have not caused any delays.
Not at all, thank you David for all your helping on R-help !!!
Martin
> --
> David.
[..................]
> David Winsemius
> Alameda, CA, USA
> 'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law
ok.... given your other statement, it may be that Matrix *is*
sufficiently adanced ;-) :-)
More information about the R-help
mailing list