[R] Variable Importance in pls: R or B? (and in glpls?)

Berton Gunter gunter.berton at gene.com
Mon Sep 13 18:13:34 CEST 2004


I noted that there were not a great number of people leaping to reply. One
reason, I suspect, is that there's really NO GOOD ANSWER to this question.
First, there is a huge literature on this -- it's related to variable
selection in regression and shrinkage estimates, but, in general,
parsimonious model building; second, as Ron Wehrens already noted, when
variables are correlated -- which could have as much to do with the vagaries
of the sampling as to real physical causality -- the whole notion of
"variable importance" is problematic. Fact is, **any** attempt to rank the
contributions of particular variables to PLS components from undesigned data
(the usual case) is fraught with hazard. For that reason, it is perhaps best
to view pls as merely a way of developing a good predictor, not as a way to
uncover causal relationships. I know this is often unsatisfying to
scientists trying to build parsimonious mechanistic models (= physical
theories), especially as there is quite often little likelihood that the
data are representative of any underlying population and therefore capable
of predicting anything, but it is the statistical reality.

For a more informed, more interesting, and more eloquent discussion of these
and related issues, you might look up Leo Breiman's writings on his web site
and his way of trying to assess "variable importance" in his Random Forest
methodology, which is available in the package randomForest on CRAN. (I make
no claim about the effectiveness of this approach -- only that it is clearly
different way of approaching the issue that clearly reveals the dilemmas).


-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Christoph Lehmann
> Sent: Sunday, September 12, 2004 5:13 AM
> To: Ron Wehrens; r-help at stat.math.ethz.ch
> Subject: [R] Variable Importance in pls: R or B? (and in glpls?)
> Dear R-users, dear Ron
> I use pls from the pls.pcr package for classification. Since 
> I need to 
> know which variables are most influential onto the classification 
> performance, what criteria shall I look at:
> a) B, the array of regression coefficients for a certain 
> model (means a 
> certain number of latent variables) (and: squared or absolute values?)
> OR
> b) the weight matrix RR (or R in the De Jong publication; in Ding & 
> Gentleman this is the P Matrix and called 'loadings')? (and again: 
> squared or absolute values?)
> and what about glpls (glpls1a) ?
> shall I look at the 'coefficients' (regression coefficients)?
> Thanks for clarification
> Christoph
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html

More information about the R-help mailing list