[R] coxpath() in package glmpath

Sun Mar 2 01:32:02 CET 2008

Hi,

I am new to model selection by coefficient shrinkage
method such as lasso. And I became particularly
interested in variable selection in Cox regression by
lasso. I became aware of the coxpath() in R package
glmpath does lasso on Cox model. I have tried the
sample script on the help page of coxpath(), but I
have difficult time understanding the output.
Therefore, I would greatly appreciate if anyone can
help me understand how to use the function.

> data(lung.data)
> attach(lung.data)
> fit.a <- coxpath(lung.data)
> print(fit.a)
Call:
coxpath(data = lung.data)
Step 1 :  karno
Step 2 :  celltype
Step 5 :  trt
Step 6 :  prior
Step 7 :  age
Step 8 :  diagtime

> summary(fit.a)
Call:
coxpath(data = lung.data)
       Df Log.p.lik       AIC       BIC
Step 1  0 -505.8840 1011.7679 1011.7679
Step 2  1 -486.0691  974.1382  977.0581
Step 5  2 -484.8520  973.7040  979.5440
Step 6  3 -483.4018  972.8036  981.5636
Step 7  4 -483.3801  974.7602  986.4401
Step 8  5 -483.2287  976.4573  991.0572
Step 9  6 -483.1112  978.2224  995.7423

first of all, why the number of steps between the
above 2 outputs are different? I confirmed with
coxph() that the numbers (log.p.lik, AIC, BIC) on the
1st row of summary(fit.a) are from a NULL Cox model,
i.e. a model with only an intercept. Then how Step 1
in
the output of summary(fit.a) is corresponding to "Step
1" in the output of print(fit.a) where it seems to
mean a model with the variable "karno"?

>predict(fit.a)
    trt celltype karno   diagtime     age       prior
1 0.0000 0.0000  0.0000 0.000e+00  0.000e+00 0.000e+00
2 0.0000 0.0076 -0.0256 0.000e+00  0.000e+00 0.000e+00
5 0.0000 0.0450 -0.0286 0.000e+00  0.000e+00 0.000e+00
6 0.1428 0.1033 -0.0330 0.000e+00  0.000e+00
-4.326e-05
7 0.1468 0.1048 -0.0332 0.000e+00 -1.043e-07
-3.506e-04
8 0.1755 0.1139 -0.0340 5.642e-06 -1.404e-03
-2.367e-03
attr(,"s")
[1] 1 2 5 6 7 8
attr(,"fraction")
[1] 0.000 0.125 0.500 0.625 0.750 0.875
attr(,"mode")
[1] "step"

Second, if we compare the output of print(fit.a) and
predict(fit.a), I can see some discrepancies. For
example, "Step 1" of print(fit.a) was variable
"karno", however, predict(fit.a) showed that the
coefficient of "karno" was still 0. The same went with
variable "trt" in "Step 5". What is the meaning of the
discrepancies? I think I probably misunderstand the
whole meaning of coefficient shrinkage in the first
place. So I would appreciate if anyone can shed some
lights.

I would also like to have any opinion on how I should
do variable selection from these output? Should I rely
on the table (log.p.lik, aic, bic) from summary fit.a)
, or should I rely on the coefficients table from
print(fit.a) to eliminate those variables with 0
coefficients at certain step?

Thank you very much for your time.