[R] stats 'dist' euclidean distance calculation
S Ellison
S.Ellison at LGCGroup.com
Thu Mar 15 13:11:18 CET 2018
> 3x3 subset used
> Locus1 Locus2 Locus3
> Samp1 GG <NA> GG
> Samp2 AG CA GA
> Samp3 AG CA GG
>
> The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2)) My
> assumption was that the difference between x_i and y_i would be the number
> of allelic differences at each base pair site between samples.
Base R does not share your assumption, which (from a general purpose stats point of view) would be a completely outlandish interpretation of the data. As far as base R is concerned, these are just arbitrary character strings represented (by default) as factors. Since factors are, internally, integers assigned (by default) in increasing lexical order to the levels present, if you apply dist() to factors constructed from allele data, you will usually get complete nonsense in genetic terms.
You should probably look at something like dist.gene in the ape package: see
https://www.rdocumentation.org/packages/ape/versions/5.0/topics/dist.gene
S Ellison
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
More information about the R-help
mailing list