[R] standardization of values before call to pam() or clara()

Sat Jun 3 14:19:39 CEST 2006

>>>>> "Dylan" == Dylan Beaudette <dylan.beaudette at gmail.com>
>>>>>     on Mon, 22 May 2006 17:33:47 -0700 writes:

    Dylan> Greetings, Experimenting with the cluster package,
    Dylan> and am starting to scratch my head in regards to the
    Dylan> *best* way to standardize my data. Both functions can
    Dylan> pre-standardize columns in a dataframe. according to
    Dylan> the manual:

    Dylan> Measurements are standardized for each variable
    Dylan> (column), by subtracting the variable's mean value
    Dylan> and dividing by the variable's mean absolute
    Dylan> deviation.

    Dylan> This works well when input variables are all in the
    Dylan> same units. When I include new variables with a
    Dylan> different intrinsic range, the ones with the largest
    Dylan> relative values tend to be _weighted_ . this is
    Dylan> certainly not surprising, but complicates things.

    Dylan> Does there exist a robust technique to effectively
    Dylan> re-scale each of the variables, regardless of their
    Dylan> intrinsic range to some set range, say from {0,1} ?

    Dylan> I have tried dividing a variable by the maximum value
    Dylan> of that variable, but I am not sure if this is
    Dylan> statistically correct.

A more usual scaling standardization is accomplished by the
function -- guess what? -- scale()

It defaults to standardize to mean 0 and std. 1.
But you can use it as well to do a [0,1] scaling.

Note that you are very wise to think about the importance of
variable scaling / weighting for cluster analysis.
But people have been "here" before, and invented the much more
general notion of a distance/dissimilarity between observational
units.
--> function  daisy() {in "cluster"} or  dist() {from "stats"}
provide such dissimilarity objects.
These can be used as input for  pam() or clara() as well,
and in constructing them you are much more flexible than trying
to find a proper scaling of your x-matrix.

Note that daisy() in particular has been designed for computing
sensible dissimilarities for the case when X-matrix has a
collection of continuous {eg "interval scaled"} and of
categorical (e.g binary) variables.

I recommend you get a textbook on clustering, to read up more on
the subject.

Regards, 
Martin Maechler, ETH Zurich

    Dylan> Any ideas, thoughts would be greatly appreciated.

    Dylan> Cheers,

    Dylan> -- Dylan Beaudette Soils and Biogeochemistry Graduate
    Dylan> Group University of California at Davis 530.754.7341