[R] stats 'dist' euclidean distance calculation
Cheyenne Yancey Payne
cypayne at stanford.edu
Thu Mar 15 01:11:43 CET 2018
Hello,
I am working with a matrix of multilocus genotypes for ~180 individual snail samples, with substantial missing data. I am trying to calculate the pairwise genetic distance between individuals using the stats package 'dist' function, using euclidean distance. I took a subset of this dataset (3 samples x 3 loci) to test how euclidean distance is calculated:
3x3 subset used
Locus1 Locus2 Locus3
Samp1 GG <NA> GG
Samp2 AG CA GA
Samp3 AG CA GG
The euclidean distance function is defined as: sqrt(sum((x_i - y_i)^2))
My assumption was that the difference between x_i and y_i would be the number of allelic differences at each base pair site between samples. For example, the euclidean distance between Samp1 and Samp2 would be:
euclidean distance = sqrt( S1_L1 - S2_L1)^2 + (S1_L2 - S2_L2)^2 + (S1_L3 - S2_L3)^2 )
at locus 1: GG - AG --> one basepair difference --> (1)^2 = 1
at locus 2: <NA> - CA --> two basepair differences --> (2)^2 = 4
at locus 3: GG - GA --> one basepair difference --> (1)^2 = 1
euclidean distance = sqrt( 1 + 4 + 1 ) = sqrt( 6 ) = 2.44940
Calculating euclidean distances this way, the distance matrix should be:
# Samp1 Samp2 Samp3
# Samp1 0.000000 2.449400 2.236068
# Samp2 2.449400 0.000000 1.000000
# Samp3 2.236068 1.000000 0.000000
However, this distance matrix differs from the one calculated by the R stats package 'dist' function:
# Samp1 Samp2 Samp3
# Samp1 0.000000 3.478652 2.659285
# Samp2 3.478652 0.000000 2.124787
# Samp3 2.659285 2.124787 0.000000
I used the following code (with intermediate output) to achieve the latter distance matrix:
>>>
setwd("~/Desktop/R_stuff")
### Data Prep: load and collect info from matrix file
infile<-'~/Desktop/R_stuff/good_conch_mplex_03052018.txt'
Mydata <- read.table(infile, header = TRUE, check.names = FALSE)
dim(Mydata) # dimensions of data.frame
ind <- as.character(Mydata$sample) # individual labels
population <- as.character(Mydata$region) # population labels
location <- Mydata$location
### Section 1: Convert data to usable format
# removes non-genotype data from matrix (i.e. lines 1-4)
# subset 3 samples, 3 loci for testing
SAMPS<-3
locus <- Mydata[c(1:SAMPS), -c(1, 2, 3, 4, 5+SAMPS:ncol(Mydata))]
locus
# Locus1 Locus2 Locus3
# Samp1 GG <NA> GG
# Samp2 AG CA GA
# Samp3 AG CA GG
# converts geno matrix to genind object (adegenet)
Mydata1 <- df2genind(locus, ploidy = 2, ind.names = ind[1:SAMPS], pop = population[1:SAMPS], sep="")
Mydata1$tab # get stats on newly created genind object
# Locus1.G Locus1.A Locus2.C Locus2.A Locus3.G Locus3.A
# Samp1 2 0 NA NA 2 0
# Samp2 1 1 1 1 1 1
# Samp3 1 1 1 1 2 0
### Section 2: Individual genetic distance: euclidean distance (dist {adegenet})
# Test dist() function
# collect euclidean genetic distances
distgenEUCL <- dist(Mydata1, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
distgenEUCL
# Samp1 Samp2
# Samp2 2.449490
# Samp3 1.732051 1.414214
x<-as.matrix(dist(distgenEUCL))
x
# Samp1 Samp2 Samp3
# Samp1 0.000000 3.478652 2.659285
# Samp2 3.478652 0.000000 2.124787
# Samp3 2.659285 2.124787 0.000000
>>>
I have checked several forums and have been unable to resolve this discrepancy. Any and all help will be much appreciated!
Thank you!
Cheyenne
Cheyenne Payne
Bioinformatics Technician
Palumbi Lab | Hopkins Marine Station
(714) 200-5897
[[alternative HTML version deleted]]
More information about the R-help
mailing list