[R] statistical significance test for cluster agreement

Thu Mar 25 04:00:45 CET 2004

> From: Alexander Sirotkin [at Yahoo] [mailto:alex_s_42 at yahoo.com] 
> 
> Christian,
> 
> I think I understand your point, but I do not
> completely agree with you. I also did not describe 
> my problem clear enough.
> 
> > If you see two
> > clusterings on the same
> > data, they are identical, if they are 100%
> > identical, and if not, then
> > not. 
> 
> What you are actually saying is that all values of 
> Rand index for cluster agreement other then 1 
> inidicate that clusters do not agree. I believe
> that many people would disagree with this statement.
> 
> Let me explain my problem in a little bit more detail.
> 
> I have some classified data set. These classes were 
> ontained using non-statistical methods. What I'm
> trying
> to do is run some clustering algorithm and compare
> it's results to this known classification.
> 
> I think that this is not very different from
> calculating mean and comparing it to some known value.

AFAICS they are most definitely not the same.  The hypotheses in statistical
tests are about `true', unknown, population mean, not the sample mean
observed in the data.  What exactly would be the hypotheses you intend to
test?  If you are testing whether the clustering algorithm produces
something that disagree with the non-statistical classification, then one
disagreement would have settled it, no?  Before you think about what
statistic to use, do try to figure out how you would write the null and
alternative hypotheses, mathematically.

Andy

> I think that is should be theoretically possible to
> use
> Rand index as a test statistic. 
> 
> Or maybe I'm missing something...
> 
> __________________________________
> 
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}