[R] modification of cross-validations in rpart

Petr Savicky savicky at praha1.ff.cuni.cz
Tue Jul 5 11:20:57 CEST 2011


On Mon, Jul 04, 2011 at 09:22:23AM -0400, Katerine Goyer wrote:
> 
> Hello, 
> 
> I am using
> the rpart function (from the rpart package) to do a regression tree that would describe
> the behaviour of a fish species according to several environmental variables.
> For each fish (sampling unit), I have repeated observations of the response
> variable, which means that the data are not independent. Normally, in this
> case, V-fold cross-validation needs to be modified to prevent over-optimistic
> predictions of error rates by cross-validation and overestimation of the tree
> size. A way to overcome this problem is by selecting only whole sampling units
> in our subsets of cross-validation. My problem is that I don?t know how to
> perform this modification of the cross-validation process in the rpart
> function.
> 
> 
> Is there a
> way to do this modification in rpart or is there any other function I could use
> that would consider interdependence in the response variable?
> 
> 
> Here is an
> example of the code I am using (?Y? being the response variable and ?data.env?
> being a data frame of the environmental 
> variables):
> 
> 
> Tree = rpart(Y
> ~ X1 + X2 + X3,xval=100,data=data.env) 
> 

Hello.

It may be needed to program crossvalidation at the R level
using package tree, which does not contain crossvalidation
itself. An example is as follows

  library(tree)
  X1 <- rnorm(200)
  X2 <- rnorm(200)
  X3 <- rnorm(200)
  Y <- ifelse(X1 > 0, X2, X3)
  data.env <- data.frame(X1, X2, X3, Y)
  ind <- rep(1:7, times=c(20, 30, 35, 30, 30, 25, 30)) # length(ind) == nrow(data.env)
  pred <- rep(NA, times=nrow(data.env))
  for (i in unique(ind)) {
      Tree <- tree(Y ~ X1 + X2 + X3, data=data.env[ind != i, ])
      PrunedTree <- prune.tree(Tree, best = 10)
      pred[ind == i] <- predict(PrunedTree, newdata=data.env[ind == i, ])
  }
  plot(data.env$Y, pred, asp=1)

The vector ind should be prepared so that all occurences of
the same fish have the same value. See ?tree and ?prune.tree
for further parameters.

Consider also randomForest package, which may be more accurate,
although it does not provide a comprehensible model.

Hope this helps.

Petr Savicky.



More information about the R-help mailing list