[R] define number of clusters in kmeans/apcluster analysis
Ulrich Bodenhofer
bodenhofer at bioinf.jku.at
Tue Dec 15 12:57:23 CET 2015
Dear Luigi,
As the others have replied already, you cannot expect a clustering
algorithm to produce exactly the result that you expect intuitively. The
results of clustering algorithms depend largely on the parameters and,
even more importantly, on the distance/similarity measure that is used.
k-means, for instance, uses the Euclidean distance. As a result, it
works nicely for spherical clusters that have approximately the same
radius. APCluster, unless you don't choose a different similarity, uses
negative squared distances which leads to very similar properties. Your
data set consists of two clusters, one of which is much more spread out.
That some parts of the larger cluster are being assigned to the other
cluster looks weird, but it is perfectly explained by the properties of
the algorithms. There is a lot of literature about the properties of
clustering algorithms around. That's my 2 cents about this. In your
case, however, as already pointed out in Bill Dunlap's reply, the
scaling is the more important issue. k-means and apcluster do not
perform any scaling of the data. Your two axes differ strongly in terms
of scaling. Enter the following to see how the two clustering algorithms
"see" your data (i.e. with two equally scaled axes):
plot(z, xlim=c(0, 50), ylim=c(0, 50))
Given this, it is no longer surprising that both algorithms split the
data in the way they do.
Actually, if you re-scale the data, apcluster produces the result you
expect:
z2 <- scale(z)
m <- apclusterK(negDistMat(r=2), z2, K=2, verbose=TRUE)
plot(m, z2)
plot(m, z) ## it even works to superimpose the clustering result on
the original data
I hope that helps.
Best regards,
Ulrich
More information about the R-help
mailing list