[R] LASSO: glmpath and cv.glmpath

Fri Aug 21 17:16:18 CEST 2009

Hi,

On Aug 21, 2009, at 9:47 AM, Peter Schüffler wrote:

> Hi,
>
> perhaps you can help me to find out, how to find the best Lambda in  
> a LASSO-model.
>
> I have a feature selection problem with 150 proteins potentially  
> predicting Cancer or Noncancer. With a lasso model
>
> fit.glm <- glmpath(x=as.matrix(X), y=target, family="binomial")
>
> (target is 0, 1 <- Cancer non cancer, X the proteins, numerical in  
> expression), I get following path (PICTURE 1)
> One of these models is the best, according to its crossvalidation  
> (PICTURE 2), the red line corresponds to the best crossvalidation.  
> Its produced by
>
> cv <- cv.glmpath(x=as.matrix(X), y=unclass(T)-1, family="binomial",  
> type ="response", plot.it=TRUE, se=TRUE)
> abline(v= cv$fraction[max(which(cv$cv.error==min(cv$cv.error)))],  
> col="red", lty=2, lwd=3)
>
>
> Does anyone know, how to conclude from the Normfraction in PICTURE 2  
> to the corresponding model in PICTURE 1? What is the best model?  
> Which coefficients does it have? I can only see the best model's  
> cross validation error, but not the actual model. How to see it?

None of your pictures came through, so I'm not sure exactly what  
you're trying to point out, but in general the cross validation will  
help you find the best value for lambda for the lasso. I think it's  
the value of lambda that you'll use for your downstream analysis.

I haven't used the glmpath package, but I have been using the glmnet  
package which is also by Hastie, newer, and I believe covers the same  
use cases as the glmpath library (though, to be honest, I'm not quite  
familiar w/ the cox proportions hazard model). Perhaps you might want  
to look into it.

Anyway, speaking from my experience w/ the glmnet packatge, you might  
try this:

1. Determine the best value of lambda using CV. I guess you can use  
MSE or R^2 as you see fit as your yardstick of "best."

2. Train a model over all of your data and ask it for the coefficients  
at the given value of lambda from 1.

3. See which proteins have non-zero coefficients.

<tongue-in-cheek>
4. Divine a biological story that is explained by your statistical  
findings

4. Publish.
</tongue-in-cheek>

I guess there are many ways to do model selection, and I'm not sure  
it's clear how effective they are (which isn't to say that you  
shouldn't don't do them)[1] ... you might want to further divide your  
data into training/tuning/test (somewhere between steps 1 and 2) as  
another means of scoring models.

HTH,
-steve

[1] http://hunch.net/?p=29

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact