[R] loop vs. apply(): strange behavior with data frame?

Roberto Perdisci roberto.perdisci at gmail.com
Thu Oct 22 02:17:25 CEST 2009


Hi everybody,
  I noticed a strange behavior when using loops versus apply() on a data frame.
The example below "explicitly" computes a distance matrix given a
dataset. When the dataset is a matrix, everything works fine. But when
the dataset is a data.frame, the dist.for function written using
nested loops will take a lot longer than the dist.apply

######## USING FOR #######

dist.for <- function(data) {

  d <- matrix(0,nrow=nrow(data),ncol=nrow(data))
  n <- ncol(data)
  r <- nrow(data)

  for(i in 1:r) {
     for(j in 1:r) {
        d[i,j] <- sum(abs(data[i,]-data[j,]))/n
     }
  }

  return(as.dist(d))
}

######## USING APPLY #######

f <- function(data.row,data.rest) {

  r2 <- as.double(apply(data.rest,1,g,data.row))

}

g <- function(row2,row1) {
  return(sum(abs(row1-row2))/length(row1))
}

dist.apply <- function(data) {
  d <- apply(data,1,f,data)

  return(as.dist(d))
}


######## TESTING #######

library(mvtnorm)
data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10))

tf <- system.time(df <- dist.for(data))
ta <- system.time(da <- dist.apply(data))

print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
print("tf = ")
print(tf)
print("ta = ")
print(ta)

print('----------------------------------')
print('Same experiment on data.frame...')
data2 <- as.data.frame(data)

tf <- system.time(df <- dist.for(data2))
ta <- system.time(da <- dist.apply(data2))

print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
print("tf = ")
print(tf)
print("ta = ")
print(ta)

########################

Here is the output I get on my system (R version 2.7.1 on a Debian lenny)

[1] "diff =  0"
[1] "tf = "
   user  system elapsed
  0.088   0.000   0.087
[1] "ta = "
   user  system elapsed
  0.128   0.000   0.128
[1] "----------------------------------"
[1] "Same experiment on data.frame..."
[1] "diff =  0"
[1] "tf = "
   user  system elapsed
 35.031   0.000  35.029
[1] "ta = "
   user  system elapsed
  0.184   0.000   0.185

Could you explain why that happens?

thank you,
regards

Roberto




More information about the R-help mailing list