[R] pseudo-R2 or GOF for regression trees?

Sat May 5 22:52:25 CEST 2007

Prof. Jeffrey Cardille wrote:
> Hello,
> 
> Is there an accepted way to convey, for regression trees, something  
> akin to R-squared?
> 
> I'm developing regression trees for a continuous y variable and I'd  
> like to say how well they are doing. In particular, I'm analyzing the  
> results of a simulation model having highly non-linear behavior, and  
> asking what characteristics of the inputs are related to a particular  
> output measure.  I've got a very large number of points: n=4000.  I'm  
> not able to do a model sensitivity analysis because of the large  
> number of inputs and the model run time.
> 
> I've been googling around both on the archives and on the rest of the  
> web for several hours, but I'm still having trouble getting a firm  
> sense of the state of the art.  Could someone help me to quickly  
> understand what strategy, if any, is acceptable to say something like  
> "The regression tree in Figure 3 captures 42% of the variance"?  The  
> target audience is readers who will be interested in the subsequent  
> verbal explanation of the relationship, but only once they are  
> comfortable that the tree really does capture something.  I've run  
> across methods to say how well a tree does relative to a set of trees  
> on the same data, but that doesn't help much unless I'm sure the  
> trees in question are really capturing the essence of the system.
> 
> I'm happy to be pointed to a web site or to a thread I may have  
> missed that answers this exact question.
> 
> Thanks very much,
> 
> Jeff
> 
> ------------------------------------------
> Prof. Jeffrey Cardille
> jeffrey.cardille at umontreal.ca
> R-help at stat.math.ethz.ch mailing list

Ye (below) has a method to get a nearly unbiased estimate of R^2 from 
recursive partitioning.  In his examples the result was similar to using 
the formula for adjusted R^2 with regression degrees of freedom equal to 
about 3n/4.  You can also use something like 10-fold cross-validation 
repeated 20 times to get a fairly precise and unbiased estimate of R^2.

Frank

>@ARTICLE{ye98mea,
   author = {Ye, Jianming},
   year = 1998,
   title = {On measuring and correcting the effects of data mining and model
           selection},
   journal = JASA,
   volume = 93,
   pages = {120-131},
   annote = {generalized degrees of freedom;GDF;effective degrees of
            freedom;data mining;model selection;model
            uncertainty;overfitting;nonparametric regression;CART;simulation
            setup}
}
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University