[R] How a clustering algorithm in R can end up with negative silhouette values?

Martin Maechler maechler at stat.math.ethz.ch
Mon Feb 22 16:48:39 CET 2016


>>>>> Sarah Goslee <sarah.goslee at gmail.com>
>>>>>     on Fri, 19 Feb 2016 15:22:22 -0500 writes:

    > Ah, my guess about the confusion was wrong, then. You're
    > misunderstanding silhouette() instead.

    >> From ?silhouette:

    >      Observations with a large s(i) (almost 1) are very
    > well clustered, a small s(i) (around 0) means that the
    > observation lies between two clusters, and observations
    > with a negative s(i) are probably placed in the wrong
    > cluster.


    > In more detail, they're looking at different things.
    > clara() assigns each point to a cluster based on the
    > distance to the nearest medoid.

    > silhouette() does something different: instead of
    > comparing the distances to the closest medoid and the next
    > closest medoid, which is what you seem to be assuming,
    > silhouette() looks at the mean distance to ALL other
    > points assigned to that cluster, vs the mean distance to
    > all points in other clusters. The distance to the medoid
    > is irrelevant, except as it is one of the points in that
    > cluster.

    > So a negative silhouette value is entirely possible, and
    > means that the cluster produced doesn't represent the
    > dataset very well.

Indeed ... and this extends to pam(), even; as you say above,
 " silhouette() does something different " :

If your look at the plots of

    example(silhouette)

where the silhouettes of   pam(ruspini, k = k')  ,  k' = 2,..,6
are displayed, or if you directly look at

   plot( silhouette(ruspini, k = 6) )

you will notice that pam() itself can easily lead to negative
silhouette values.

Martin Maechler  [  == maintainer("cluster")  ]

    

    > On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam
    > <Behnam.ABABAEI at limagrain.com> wrote:
    >> Sarah, sorry for taking up your time.
    >> 
    >> I totally agree with you about how it works. But please
    >> let's take a look at this part of the description:
    >> 
    >> "Once k representative objects have been selected from
    >> the sub-dataset, each observation of the entire dataset
    >> is assigned to the nearest medoid. The mean (equivalent
    >> to the sum) of the dissimilarities of the observations to
    >> their closest medoid is used as a measure of the quality
    >> of the clustering. The sub-dataset for which the mean (or
    >> sum) is minimal, is retained. A further analysis is
    >> carried out on the final partition."
    >> 
    >> It says each observation is finally assigned to the
    >> closest medoid. The whole clustering process may be
    >> imperfect in terms of isolation of clusters, but each
    >> observation is already assigned to the closest one and
    >> according to the silhouette formula, the silhouette value
    >> cannot be negative, as a must be always less than b.
    >> 
    >> Regards, Behnam.
    >> 
    >> ________________________________________ From: Sarah
    >> Goslee <sarah.goslee at gmail.com> Sent: 19 February 2016
    >> 20:58 To: ABABAEI, Behnam Cc: r-help at r-project.org
    >> Subject: Re: [R] How a clustering algorithm in R can end
    >> up with negative silhouette values?
    >> 
    >> You need to think more carefully about the details of the
    >> clara() method.
    >> 
    >> The algorithm draws repeated samples of sampsize from the
    >> larger dataset, as specified by the arguments to the
    >> function.  It clusters each sample in turn, and saves the
    >> best one.  It uses the medoids from the best one to
    >> assign all of the points to a cluster.
    >> 
    >> But because the clustering is based on a subsample, it
    >> may not be representative of the dataset as a whole, and
    >> may not provide a good clustering overall. Just because
    >> it clusters the subsample well, doesn't mean it clusters
    >> the entirety. The details section of the help describes
    >> this, and the book references goes into more detail.
    >> 
    >> Sarah
    >> 
    >> 
    >> 
    >> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam
    >> <Behnam.ABABAEI at limagrain.com> wrote:
    >>> Hi Sarah,
    >>> 
    >>> Thank you for the response. But it is said in its
    >>> description that after each run (sample), each
    >>> observation in the whole dataset is assigned to the
    >>> closest cluster. So how is it possible for one
    >>> observation to be wrongly allocated, even with clara?
    >>> 
    >>> Behnam
    >>> 
    >>> Behnam
    >>> 
    >>> 
    >>> 
    >>> 
    >>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
    >>> <sarah.goslee at gmail.com> wrote:
    >>> 
    >>> That means that points have been assigned to the wrong
    >>> groups. This may readily happen with a clustering method
    >>> like cluster::clara() that uses a subset of the data to
    >>> cluster a dataset too large to analyze as a
    >>> unit. Negative silhouette numbers strongly suggest that
    >>> your clustering parameters should be changed.
    >>> 
    >>> Sarah
    >>> 
    >>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
    >>> <Behnam.ABABAEI at limagrain.com> wrote:
    >>>> Hi,
    >>>> 
    >>>> 
    >>>> We know that clustering methods in R assign
    >>>> observations to the closest medoids. Hence, it is
    >>>> supposed to be the closest cluster each observation can
    >>>> have. So, I wonder how it is possible to have negative
    >>>> values of silhouette , while we are supposedly assign
    >>>> each observation to the closest cluster and the formula
    >>>> in silhouette method cannot get negative?
    >>>> 
    >>>> 
    >>>> Behnam.
    >>>> 

    > ______________________________________________
    > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
    > more, see https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide
    > http://www.R-project.org/posting-guide.html and provide
    > commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list