[R] Help needed to clarify hclust and cutree algorithms

Mon Sep 21 08:04:15 CEST 2009

Dear R Helpers,

I read carefully the documentation and all postings on the hclust and cutree functions, however some aspects of the tree ordering and cluster assignment performed by these functions remain unclear to me, so I would very much appreciate your help in making sure I get them right.

Here is an example, with values chosen to illustrate the problems.

I have a set of five profiles comprised of measurements of five variables (V1 to V5) in 4 different conditions (c1 to c4).

df = data.frame(rbind(c(-32, -52, -46, -35, -35), c(-86, -111, -101, -96, -105), c(17, 42, 36, 34, 37), c(24, 37, 28, 29, 30)))

colnames(df) = c("V1", "V2", "V3", "V4", "V5")
rownames(df) = c("c1", "c2", "c3", "c4")

> df
    V1   V2   V3  V4   V5
c1 -32  -52  -46 -35  -35
c2 -86 -111 -101 -96 -105
c3  17   42   36  34   37
c4  24   37   28  29   30

plot(df[,1], type="l", ylim=range(df))
points(df[1,1], type="p", pch=49)
for (i in 2:5) {
   points(df[,i], type="l", col=colors()[15*i])
   points(df[1,i], type="p", pch=48+i)
}

The tasks is to determine how correlated these profiles are and to partition them in two groups using hierarchical clustering.  Importantly, I need to output the order in which the variables occur in these clusters, from left to right in decreasing order of their correlation.  Because of this the number assigned to the clusters (1 or 2) and the order in which the variables are listed within them become very important.

For this I used the hclust and cutree functions:

cor.df =  cor(df, method="pearson")
dist.df = as.dist(1-cor.df)

hc.df = hclust(dist.df, method="complete")
hc.df.cl = cutree(hc.df, k=2)

> str(hc.df)
List of 7
 $ merge      : int [1:4, 1:2] -4 -2 -1 2 -5 -3 1 3
 $ height     : num [1:4] 0.00043 0.00048 0.004916 0.010176
 $ order      : int [1:5] 2 3 1 4 5
 $ labels     : chr [1:5] "V1" "V2" "V3" "V4" ...
 $ method     : chr "complete"
 $ call       : language hclust(d = dist.df, method = "complete")
 $ dist.method: NULL
 - attr(*, "class")= chr "hclust"

> hc.df.cl
V1 V2 V3 V4 V5 
 1  2  2  1  1 

> round(dist.df*1000, 2)
      V1    V2    V3    V4
V2 10.18                  
V3 10.11  0.48            
V4  4.42  3.74  2.27      
V5  4.92  6.61  4.33  0.43

plot(hc.df)

My questions are:

1.  Can I assume that plot(hc.df) and hc.df$order indicate that the order of merging was:

V2 V3 V1 V4 V5 ?

This does not seem to be supported by the distance matrix which shows that the closest pair to begin with is V4-V5.

Also the element closest to V2 or V3 is V4, and not V1.

The hclust help states that    
     In hierarchical cluster displays, a decision is needed at each
     merge to specify which subtree should go on the left and which on
     the right. Since, for n observations there are n-1 merges, there
     are 2^{(n-1)} possible orderings for the leaves in a cluster tree,
     or dendrogram. The algorithm used in 'hclust' is to order the
     subtree so that the tighter cluster is on the left (the last,
     i.e., most recent, merge of the left subtree is at a lower value
     than the last merge of the right subtree). Single observations are
     the tightest clusters possible, and merges involving two
     observations place them in order by their observation sequence
     number.

In this light shall I look at the plot and $order as a flipped version of 

V1 V4 V5 V3 V2  ?

Would it be possible that somebody could be so kind and actually indicate step by step how the merges are done?

2. When cutree cuts the tree in two clusters, which number does it assign to the cluster in which the profiles are most correlated?  Is the numbering simply from the right to left of the tree as it appears in hc.df$order?

3. If I take into account only the hc.df$order slot and the cluster number assigned by cutree 

> hc.df$order
[1] 2 3 1 4 5

> hc.df.cl
V1 V2 V3 V4 V5 
 1  2  2  1  1 

can I infer that the order of variables from left to right in decreasing order of correlation between profiles is:

variable V2 V3 V1 V4 V5
cluster  1  1  1  2  2

Is this correct?  It does not seem to be supported by the actual distance matrix. Even in reverse and with the cluster numbers flipped, the immediate neighbor of V4 should be V3 and not V1.

3. Most importantly, how I could use the results of these functions to output the following:

   A. The two clusters, labeled such that cluster 1 contains the pair of profiles with smallest distance from each other.

   B. The order of variables in decreasing order of correlation (increasing value of distance).  In this way the value listed after the last entry in cluster 1 will be the closest in distance to the members of cluster1.

   Can I use only the results of these functions (and how), or do I need to do other data manipulation (and if so what exactly) to make sure the output complies to the requirements above?

Thank you very much for your help in clarifying these issues.

> sessionInfo()
R version 2.8.1 (2008-12-22) 
i486-pc-linux-gnu 

With best regards,
Dana Sevak