[R] LASSO: glmpath and cv.glmpath
Steve Lianoglou
mailinglist.honeypot at gmail.com
Fri Aug 21 17:16:18 CEST 2009
Hi,
On Aug 21, 2009, at 9:47 AM, Peter Schüffler wrote:
> Hi,
>
> perhaps you can help me to find out, how to find the best Lambda in
> a LASSO-model.
>
> I have a feature selection problem with 150 proteins potentially
> predicting Cancer or Noncancer. With a lasso model
>
> fit.glm <- glmpath(x=as.matrix(X), y=target, family="binomial")
>
> (target is 0, 1 <- Cancer non cancer, X the proteins, numerical in
> expression), I get following path (PICTURE 1)
> One of these models is the best, according to its crossvalidation
> (PICTURE 2), the red line corresponds to the best crossvalidation.
> Its produced by
>
> cv <- cv.glmpath(x=as.matrix(X), y=unclass(T)-1, family="binomial",
> type ="response", plot.it=TRUE, se=TRUE)
> abline(v= cv$fraction[max(which(cv$cv.error==min(cv$cv.error)))],
> col="red", lty=2, lwd=3)
>
>
> Does anyone know, how to conclude from the Normfraction in PICTURE 2
> to the corresponding model in PICTURE 1? What is the best model?
> Which coefficients does it have? I can only see the best model's
> cross validation error, but not the actual model. How to see it?
None of your pictures came through, so I'm not sure exactly what
you're trying to point out, but in general the cross validation will
help you find the best value for lambda for the lasso. I think it's
the value of lambda that you'll use for your downstream analysis.
I haven't used the glmpath package, but I have been using the glmnet
package which is also by Hastie, newer, and I believe covers the same
use cases as the glmpath library (though, to be honest, I'm not quite
familiar w/ the cox proportions hazard model). Perhaps you might want
to look into it.
Anyway, speaking from my experience w/ the glmnet packatge, you might
try this:
1. Determine the best value of lambda using CV. I guess you can use
MSE or R^2 as you see fit as your yardstick of "best."
2. Train a model over all of your data and ask it for the coefficients
at the given value of lambda from 1.
3. See which proteins have non-zero coefficients.
<tongue-in-cheek>
4. Divine a biological story that is explained by your statistical
findings
4. Publish.
</tongue-in-cheek>
I guess there are many ways to do model selection, and I'm not sure
it's clear how effective they are (which isn't to say that you
shouldn't don't do them)[1] ... you might want to further divide your
data into training/tuning/test (somewhere between steps 1 and 2) as
another means of scoring models.
HTH,
-steve
[1] http://hunch.net/?p=29
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the R-help
mailing list