[R] memory issue trying to solve too large a problem using hclust

cstrato@EUnet.at cstrato at EUnet.at
Tue Dec 4 21:19:43 CET 2001

Dear Aboubakar

Thank you for your reply. I know that clustering is not a trivial
issue, this was the reason I thought that I could start a discussion.
It may not seem to belong to r-help but since many people (including
me) use R/S for expression profiling, I thought I will try it anyhow.

Since you mention  distance based on correlation, this was my
question #2: Is it possible, that R/S can also support it for hclust?
Since I use S/R as my main packages, it it a severe limitation to
have a limited choice of metrices.

You mention that k-means can have many solutions, but as far, as
I know, the results of agglomerative hierarchical clustering
depend on  the order of the data? For this reason, one company
(Applied Maths) does even calculate the significance of the branches
of a tree using bootstrap techniques. Could this possibly be done
also with R/S?

Furthermore, if I remember correctly, someone has mentioned that
divisive hierarchical clustering would be preferrable to agglomerative
clustering, but there exist no algorithms to calculate it in a reasonable
time. (Could it be that this was mentioned by Prof. Ripley?)

Quite some time ago I have tried the different cluster algorithms
and metrices available in S/R and at that time, DIANA  seemed to give
the best results. I think it is sorry, that more recent cluster algorithms
such as CURE etc (see question #4) are not implemented so that
it is not possible to try them and compare them with the currently
used ones.

(BTW, mclust seems to give especially bad results, but I do not
know why?)

Personally, I would prefer to have a function, which would cluster
data using a couple of different cluster algorithms, then identify those
branches in a tree which always turn up to be in the same sub-cluster,
which could then be considered as "stable".

Best regards
Christian Stratowa

Aboubakar Maitournam wrote:

> I m not famous statistician (so I will walk on eggs) but I know that the clustering
> problem is not a trivial task and is not
> completely solved. The most used technique in the clustering of genes expression
> data is based
> on hierarchical clustering which is depending of the choice of distance. There is
> some consensus
> about the distance based on correlation (take care because sometimes it's not the
> distance is the
> strict topological sense, in the sense of metric space). In addition the
> hierarchical clustering is noise
> depending. But related to the phylogenetic practices and the pioneer work of Eisen,
> the hierarchical clustering is
> the wide technique used in the area of the genes expression data analysis (for the
> clustering).
> The k-means as hierarchical clustering has arbitrary choices and can give many
> solutions.
> The methods which are in theoretical developments, which give the number of
> clustering in data and determine the corresponding
> classes are based on mixture models as the package mclust or some published work
> base of simulated annealing.
> But naturally it's difficult to change "les habitudes" (the usual practices) and
> perhaps the stochastic background which is not  poetic on which these methods
> are based , is explaining why they are not used.
> Finally if you want to use the classical methods (pca, k-means, hierarchical
> clustering) the best methods is to try at least two methods.
> Notes there is also non classical methods based on graphs theory or neural networks
> but the objective methods remains
> pca  and stochastic methods.
> Aboubakar Maitournam.

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list