[R] memory issue trying to solve too large a problem using hclust

Tue Dec 4 10:32:07 CET 2001

"cstrato at EUnet.at" wrote:

> Hi all, hi Matthew
>
> I would like to extend this question and take the opportunity
> to ask all the famous statisticians in this group for advice.
>
> First a personal comment :-)
> I am quite amused, how easy it is sometimes to find out on
> which project someone writing to this group is working:
> You mention that you want to cluster 12,500 objects. If I am
> correct, you are trying to cluster the 12,500 genes on the
> human Affymetrix GeneChip HgU95A, correct?
> (At least this is what  I am just trying to do)
>
> Now to the questions, which I wanted to ask for quite some time:
>
> Since the time of the paper:
> Eisen MB, Spellman PT, Brown PO, Botstein D.
> Cluster analysis and display of genome-wide expression patterns.
> Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8.
> most biologists working on gene expression use hierarchical
> clustering to cluster all genes they have on their DNA-chips.
> Next year we will see chips containing more than 20,000 genes
> on one chip.
>
> Thus the question is:
> 1, What is the best way to cluster this amount of genes?
> Sometimes, I have heard, you should first use k-means to
> divide the genes into few subclusters, and use hierarchical
> clustering for the subclusters only. Is this correct?
>
> 2, When you do hierarchical clustering, what metric would
> be best to use?
> M.Eisen´s paper describes Pearson correlation as metric.
> Is there a way to implement this metric for use in hclust?
> Sorrowly, hclust supports only euclid and manhattan.
>
> 3, R/S contain some other cluster algorithms such as CLARA,
> PAM, FANNY, AGNES. However, I have never seen any paper on
> expression profiling using these algorithms. Is there a special
> reason, why these functions are not used?
>
> 4, Meanwhile, new methods for cluster analysis have been
> developed. For example, the book "Data Mining" of Han&Kamber
> mentions BIRCH, CURE, DBSCAN, OPTICS, DENCLUE, STINGS
> as some of these new algorithms.
> Would it make sense to use one of these methods?
> Does someone know if implementations of these functions
> do exist?
>
> 5, As I understand, there does not exist a single "best" cluster
> algorithm for this purpose, but you have to try different methods,
> and try to find out which one describes the data best.
> This is often easy when you cluster samples, but is hard to
> find out when trying to cluster 20,000 or even more genes.
>
> 6, Do there exist better methods other than clustering, which
> could group genes with similar behavior?
> PCA may be one method, but is based on dimensionality reduction,
> which may not be applicable in many cases?
>
> I know, that in this group questions to cluster many data have
> partly been answered, but I have the feeling, that many of these
> questions remain open, especially, when applied to expression
> profiling.
>
> I also know that many people working in this field use R/S
> as their main tool, so any help would be appreciated not only
> from me.
>
> Best regards
> Christian Stratowa
> ----------------------------------
> C.h.r.i.s.t.i.a.n  S.t.r.a.t.o.w.a
> V.i.e.n.n.a,  A.u.s.t.r.i.a
>
> "Wiener, Matthew" wrote:
>
> > Hi, all.
> >
> > I'm trying to cluster 12,500 objects using hclust from package mva.  The
> > distance matrix takes up nearly 600 MB.  The distance matrix also needs to
> > be copied when being passed to the fortran routine that actually does the
> > clustering (it's modified during the clustering), so that's 1200 MB.  I'm
> > actually on a machine with 2.5 GB of memory (and nothing else running), so I
> > thought I could pull this off.  The routine quits with the error "cannot
> > allocate a vector of size 609131 KB", which by its size seems to be another
> > copy of the distance matrix, I think the one needed by the fortran routine.
> > As far as I can tell from looking at the code, no additional objects of the
> > size of the distance matrix are used.
> >
> > After the error gc() says that the garbage collection threshold is 1433 MB.
> >
> > I'm wondering whether some additional copies of the distance matrix are
> > being made, and whether I could somehow stop them from being made.  Any
> > other suggestions for how I could get around the memory problem would also
> > be appreciated.  (I know of clara in the "cluster" package, but would like
> > to use hierarchical methods.)
> >
> > The function hierclust in multiv seems to demand even more memory, even when
> > bign = T.
> >
> > I am running R-1.3.1 on Sun OS 5.6.
> >
> > Thanks for any help.
> >
> > Matthew Wiener
> > Applied Computer Science and Mathematics Department
> > Merck Research Labs
> > Rahway, NJ  07065-0900
> > 732-594-5303
> >
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > Send "info", "help", or "[un]subscribe"
> > (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

I m not famous statistician (so I will walk on eggs) but I know that the clustering
problem is not a trivial task and is not
completely solved. The most used technique in the clustering of genes expression
data is based
on hierarchical clustering which is depending of the choice of distance. There is
some consensus
about the distance based on correlation (take care because sometimes it's not the
distance is the
strict topological sense, in the sense of metric space). In addition the
hierarchical clustering is noise
depending. But related to the phylogenetic practices and the pioneer work of Eisen,
the hierarchical clustering is
the wide technique used in the area of the genes expression data analysis (for the
clustering).
The k-means as hierarchical clustering has arbitrary choices and can give many
solutions.
The methods which are in theoretical developments, which give the number of
clustering in data and determine the corresponding
classes are based on mixture models as the package mclust or some published work
base of simulated annealing.
But naturally it's difficult to change "les habitudes" (the usual practices) and
perhaps the stochastic background which is not  poetic on which these methods
are based , is explaining why they are not used.
Finally if you want to use the classical methods (pca, k-means, hierarchical
clustering) the best methods is to try at least two methods.
Notes there is also non classical methods based on graphs theory or neural networks
but the objective methods remains
pca  and stochastic methods.

Aboubakar Maitournam.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._