[R] algorithm for clustering categorical data

Tue Aug 6 20:42:21 CEST 2013

Thanks David, This is very useful!

-----Original Message-----
From: David Carlson [mailto:dcarlson at tamu.edu] 
Sent: Tuesday, August 06, 2013 11:27 AM
To: Li, Yan; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

What do you mean by representing the categorical fields by 1:k?

a <- c("red", "green", "blue", "orange", "yellow")

becomes

a <- c(1, 2, 3, 4, 5)

That guarantees your results are worthless unless your categories have an inherent order (e.g. tiny, small, medium, big, giant).
Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

a.red <- c(1, 0, 0, 0, 0)
a.green <- c(0, 1, 0, 0, 0)
a.blue <- c(0, 0, 1, 0, 0)
a.orange <- c(0, 0, 0, 1, 0)

Then you can use Euclidean distance.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: Li, Yan [mailto:Yan_Li at ibi.com]
Sent: Tuesday, August 6, 2013 9:36 AM
To: dcarlson at tamu.edu; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

H David and other R helpers,

If I rescale the numerical fields to [0,1] and represent the categorical fields to 1:k, which is the same starting point as Gower's measure, but I use Euclidean distance instead of Gower's distance to do k-means clustering. How much is the difference? What is the draw back? 

Thanks you,
Yan

-----Original Message-----
From: David Carlson [mailto:dcarlson at tamu.edu]
Sent: Thursday, August 01, 2013 12:08 PM
To: Li, Yan; r-help at r-project.org
Subject: RE: [R] algorithm for clustering categorical data

Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data. You didn't give us any information about how you numerically transformed the categorical variables, but the usual approach is to create indicator variables that code presence/absence for each category within a categorical variable. Different variances between variables can be reduced by standardizing the variables.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Li, Yan
Sent: Thursday, August 1, 2013 11:00 AM
To: r-help at r-project.org
Subject: [R] algorithm for clustering categorical data

Hi All,

Does anyone know what algorithm for clustering categorical variables? R packages? Which is the best?

If a data has both numeric and categorical data, what is the best clustering algorithm to use and R package?

I tried numeric transformation of all categorical fields  and doing clustering afterwards. But the transformed fields have values from 1...10, and my other fields is in a bigger scale:
10000-...This will make the categorical fields has less effect on the distance calculation...

Thank you!
Yan

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.