[R] Trouble with (Very) Simple Clustering
David L Carlson
dcarlson at tamu.edu
Mon Jun 6 18:54:28 CEST 2016
I think your problem is that pvclust looks for clusters between variables and you have only one variable. When you transpose data_mat, you have a single row and dist cannot calculate a distance matrix on a single row:
> dist(t(data_mat))
dist(0)
I was going to suggest package NbClust since there is no need to transpose the data, but it fails as well. I did discover that Mclust() in package mclust works:
> library(mclust)
> Mclust(data_mat)
'Mclust' model object:
best model: univariate, unequal variance (V) with 3 components
Looking at the density plot suggests 3 groups as well:
> plot(density(data_mat))
-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352
-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Lorenzo Isella
Sent: Monday, June 6, 2016 10:08 AM
To: r-help at r-project.org
Subject: [R] Trouble with (Very) Simple Clustering
Dear All,
I am doing something extremly basic (and I do not claim at all there
is no other way to achieve the same): I have a list of numbers and I
would like to split them up into clusters.
This is what I do: I see each number as a 1D vector and I calculate
the euclidean distance between them.
I get a distance matrix which I then feed to a hierarchical clustering
algorithm.
For instance consider the following snippet
#########################################################
data_mat<-structure(c(50.1361524639595, 48.2314746179241, 30.3803078462882,
29.2679787220381, 25.5125237513957, 22.9052912406594,
21.3890604699407,
15.5680557012965, 15.322981489303, 8.36693180374788, 7.23530025890675,
6.51469907237986, 5.42861828441895, 4.61986804112007,
4.33660782487196,
3.89915821225882, 3.67394875259037, 2.32719820674605,
1.88489249113792,
1.62276579528843, 1.56048239182126, 1.49722163565454,
1.32492151010636,
1.28216249552147, 1.272235253501, 0.734274800585336,
0.326949583587343,
0.318777047947951), .Dim = c(28L, 1L), .Dimnames = list(c("EE",
"LV", "RO", "BG", "SK", "CY", "LT", "MT", "PL", "NL", "EL", "PT",
"CZ", "SE", "UK", "LU", "HR", "DK", "AT", "SI", "IE", "ES", "FI",
"FR", "DE", "IT", "HU", "BE"), NULL))
distMatrix <- dist(data_mat)
n_clus<-5 ## I arbitrarily choose to have 5 clusters
hc <- hclust(distMatrix , method="ward.D2")
groups <- cutree(hc, k=n_clus) # cut tree into 5 clusters
pdf("cluster1.pdf")
plot(hc, labels = , hang = -1, main="Mobility to Business",
yaxt='n' , ann=FALSE
)
rect.hclust(hc, k=n_clus, border="red")
dev.off()
######################################################
which gives me very reasonable results.
Now, I would like to be able to find the optimal number of cluster on
the same data.
Based on what I found
http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/
http://www.statmethods.net/advstats/cluster.html
pvclust is a sensible way to go. However, when I try to use it on my
data, I get an error
> fit <- pvclust(t(data_mat),
> method.hclust="ward.D2",method.dist="euclidean")
Error in FUN(X[[i]], ...) : invalid scale parameter(r)
does anybody understand what is my mistake?
Many thanks
Lorenzo
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list