[R] How a clustering algorithm in R can end up with negative silhouette values?
Martin Maechler
maechler at stat.math.ethz.ch
Mon Feb 22 16:48:39 CET 2016
>>>>> Sarah Goslee <sarah.goslee at gmail.com>
>>>>> on Fri, 19 Feb 2016 15:22:22 -0500 writes:
> Ah, my guess about the confusion was wrong, then. You're
> misunderstanding silhouette() instead.
>> From ?silhouette:
> Observations with a large s(i) (almost 1) are very
> well clustered, a small s(i) (around 0) means that the
> observation lies between two clusters, and observations
> with a negative s(i) are probably placed in the wrong
> cluster.
> In more detail, they're looking at different things.
> clara() assigns each point to a cluster based on the
> distance to the nearest medoid.
> silhouette() does something different: instead of
> comparing the distances to the closest medoid and the next
> closest medoid, which is what you seem to be assuming,
> silhouette() looks at the mean distance to ALL other
> points assigned to that cluster, vs the mean distance to
> all points in other clusters. The distance to the medoid
> is irrelevant, except as it is one of the points in that
> cluster.
> So a negative silhouette value is entirely possible, and
> means that the cluster produced doesn't represent the
> dataset very well.
Indeed ... and this extends to pam(), even; as you say above,
" silhouette() does something different " :
If your look at the plots of
example(silhouette)
where the silhouettes of pam(ruspini, k = k') , k' = 2,..,6
are displayed, or if you directly look at
plot( silhouette(ruspini, k = 6) )
you will notice that pam() itself can easily lead to negative
silhouette values.
Martin Maechler [ == maintainer("cluster") ]
> On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam
> <Behnam.ABABAEI at limagrain.com> wrote:
>> Sarah, sorry for taking up your time.
>>
>> I totally agree with you about how it works. But please
>> let's take a look at this part of the description:
>>
>> "Once k representative objects have been selected from
>> the sub-dataset, each observation of the entire dataset
>> is assigned to the nearest medoid. The mean (equivalent
>> to the sum) of the dissimilarities of the observations to
>> their closest medoid is used as a measure of the quality
>> of the clustering. The sub-dataset for which the mean (or
>> sum) is minimal, is retained. A further analysis is
>> carried out on the final partition."
>>
>> It says each observation is finally assigned to the
>> closest medoid. The whole clustering process may be
>> imperfect in terms of isolation of clusters, but each
>> observation is already assigned to the closest one and
>> according to the silhouette formula, the silhouette value
>> cannot be negative, as a must be always less than b.
>>
>> Regards, Behnam.
>>
>> ________________________________________ From: Sarah
>> Goslee <sarah.goslee at gmail.com> Sent: 19 February 2016
>> 20:58 To: ABABAEI, Behnam Cc: r-help at r-project.org
>> Subject: Re: [R] How a clustering algorithm in R can end
>> up with negative silhouette values?
>>
>> You need to think more carefully about the details of the
>> clara() method.
>>
>> The algorithm draws repeated samples of sampsize from the
>> larger dataset, as specified by the arguments to the
>> function. It clusters each sample in turn, and saves the
>> best one. It uses the medoids from the best one to
>> assign all of the points to a cluster.
>>
>> But because the clustering is based on a subsample, it
>> may not be representative of the dataset as a whole, and
>> may not provide a good clustering overall. Just because
>> it clusters the subsample well, doesn't mean it clusters
>> the entirety. The details section of the help describes
>> this, and the book references goes into more detail.
>>
>> Sarah
>>
>>
>>
>> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam
>> <Behnam.ABABAEI at limagrain.com> wrote:
>>> Hi Sarah,
>>>
>>> Thank you for the response. But it is said in its
>>> description that after each run (sample), each
>>> observation in the whole dataset is assigned to the
>>> closest cluster. So how is it possible for one
>>> observation to be wrongly allocated, even with clara?
>>>
>>> Behnam
>>>
>>> Behnam
>>>
>>>
>>>
>>>
>>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
>>> <sarah.goslee at gmail.com> wrote:
>>>
>>> That means that points have been assigned to the wrong
>>> groups. This may readily happen with a clustering method
>>> like cluster::clara() that uses a subset of the data to
>>> cluster a dataset too large to analyze as a
>>> unit. Negative silhouette numbers strongly suggest that
>>> your clustering parameters should be changed.
>>>
>>> Sarah
>>>
>>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
>>> <Behnam.ABABAEI at limagrain.com> wrote:
>>>> Hi,
>>>>
>>>>
>>>> We know that clustering methods in R assign
>>>> observations to the closest medoids. Hence, it is
>>>> supposed to be the closest cluster each observation can
>>>> have. So, I wonder how it is possible to have negative
>>>> values of silhouette , while we are supposedly assign
>>>> each observation to the closest cluster and the formula
>>>> in silhouette method cannot get negative?
>>>>
>>>>
>>>> Behnam.
>>>>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
> more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide
> commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list