[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
Frank Harrell
f.harrell at vanderbilt.edu
Sat Sep 28 16:48:38 CEST 2013
This entire procedure is not valid. You cannot use a penalized method for
selecting variables then use an unpenalized procedure on those selected.
Frank
David Winsemius wrote
> On Sep 28, 2013, at 2:39 AM, E Joffe wrote:
>
>> Hi all,
>>
>> I am using COX LASSO (glmnet / coxnet) regression to analyze a
>> dataset of
>> 394 obs. / 268 vars.
>> I use the following procedure:
>> 1. Construct a coxnet on the entire dataset (by cv.glmnet)
>> 2. Pick the significant features by selecting the non-zero coefficient
>> under the best lambda selected by the model
>> 3. Build a coxph model with bi-directional stepwise feature selection
>> limited to the coxnet selected features.
>
> I was a bit puzzled by the third step. Once you had a reduced model
> from glmnet, what was the statistical basis for further elimination of
> variables?
>
> (Quite apart from the statistical issues, I was rather surprised that
> this procedure even produced results since the 'step' function is not
> described in the 'stats' package as applying to 'coxph' model objects.)
>
>>
>> To validate the model I use both Brier score (library=peperr) and
>> Harrel's [Harrell]
>> C-Index (library=Hmisc) with a bootstrap of 50 iterations.
>>
>>
>> MY QUESTION : I am getting fairly good C-Index (0.76) and Brier
>> (0.07)
>> values for the models however per the coxnet the %Dev explained by
>> the model
>> is at best 0.27 and when I plot the survfit of the coxph the plotted
>> confidence interval is very large.
>
> How many events did you have? (The width of CI's is most importantly
> dependent on event counts and not particularly improved by a high case
> count. The power considerations are very similar to those of a
> binomial test.)
>
>
>> What am I missing here ?
>
> Perhaps sufficient events? (You also seem to be missing a description
> of the study goals.)
>
>
> --
> David.
>
>>
>> %DEV=27%
>>
>>
>>
>> Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24)
>> C-Index - 0.76 ($cIndex -> [1] 0.763)
>>
>>
>>
>> DATA: [Private Health Information - can't publish] 394 obs./268 vars.
>>
>> CODE (need to define a dataset with 'time' and 'status' variables):
>>
>> library("survival")
>> library ("glmnet")
>> library ("c060")
>> library ("peperr")
>> library ("Hmisc")
>>
>> #creat Y (survival matrix) for glmnet
>> surv_obj <- Surv(dataset$time,dataset$status)
>>
>>
>> ## tranform categorical variables into binary variables with
>> dummy for
>> dataset
>> predict_matrix <- model.matrix(~ ., data=dataset,
>> contrasts.arg = lapply
>> (dataset[,sapply(dataset, is.factor)], contrasts))
>>
>> ## remove the statu/time variables from the predictor matrix (x)
>> for
>> glmnet
>> predict_matrix <- subset (predict_matrix, select=c(-time,-status))
>>
>> ## create a glmnet cox object using lasso regularization and cross
>> validation
>> glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox")
>>
>> ## get the glmnet model on the full dataset
>> glmnet.obj <- glmnet.cv$glmnet.fit
>>
>> # find lambda index for the models with least partial likelihood
>> deviance (by cv.glmnet)
>> optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous
>> model use lambda.1se
>> lambda.index <- which(glmnet.obj$lambda==optimal.lambda)
>>
>>
>> # take beta for optimal lambda
>> optimal.beta <- glmnet.obj$beta[,lambda.index]
>>
>> # find non zero beta coef
>> nonzero.coef <- abs(optimal.beta)>0
>> selectedBeta <- optimal.beta[nonzero.coef]
>>
>> # take only covariates for which beta is not zero
>> selectedVar <- predict_matrix[,nonzero.coef]
>>
>> # create a dataframe for trainSet with time, status and selected
>> variables in binary representation for evaluation in pec
>> reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
>>
>> # glmnet.cox only with meaningful features selected by stepwise
>> bidirectional AIC feature selection
>> glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~
>> .,data=reformat_dataSet),direction="both")
>>
>>
>>
>>
>> ##--------------------------------------------------------------------------
>> -----------------------------
>> ## MODEL PERFORMANCE
>>
>> ##--------------------------------------------------------------------------
>> -----------------------------
>> ##
>>
>>
>> ## Calculate the Brier score - pec does its own bootstrap so this
>> function runs on i=51 (i.e., whole trainset)
>>
>> ## Brier score calculation to cox-glmnet
>> peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox,
>> fit.fun=fit.coxph, load.all=TRUE,
>>
>> indices=resample.indices(n=nrow(surv_obj),
>> method="boot", sample.n=50))
>>
>> # Get error predictions from peperr
>> prederr.coxglmnet <- perr(peperr.coxglmnet)
>>
>> # Integrated prediction error Brier score calculation
>> ipec.coxglmnet<-ipec(prederr.coxglmnet,
>> eval.times=peperr.coxglmnet$attribute, response=surv_obj)
>>
>>
>> ## C-Index calculation 50 iter bootstrapping
>>
>> for (i in 1:50){
>> print (paste("Iteration:",i))
>> train <- sample(1:nrow(dataset), nrow(dataset), replace =
>> TRUE) ##
>> random sampling with replacement
>> # create a dataframe for trainSet with time, status and
>> selected
>> variables in binary representation for evaluation in pec
>> reformat_trainSet <- reformat_dataSet [train,]
>>
>>
>> # glmnet.cox only with meaningful features selected by stepwise
>> bidirectional AIC feature selection
>> glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~
>> .,data=reformat_dataSet),direction="both")
>>
>> selectedVarCox <-
>> predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")]
>> reformat_testSet <-
>> as.data.frame(cbind(surv_obj,selectedVarCox))
>> reformat_testSet <- reformat_dataSet [-train,]
>>
>>
>> # compute c-index (Harrell's) for cox-glmnet models
>> if (is.null(glmnet.cox.meaningful)){
>> cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL)
>> }else{
>> cIndexCoxglmnet <- c(cIndexCoxglmnet,
>> 1-rcorr.cens(predict(glmnet.cox.meaningful,
>> reformat_testSet),Surv(reformat_testSet$time,reformat_testSet
>> $status))[1])
>> }
>> }
>>
>> #Get average C-Index
>> cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE)
>>
>> #create a list of all the objects generated
>>
>> assign
>> (name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li
>> st(glmnet.obj),
>>
>> selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox),
>>
>> glmnet
>> .cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c
>> oxglmnet),
>> cIndex=cIndex))
>>
>> # save image of the workspace after each iteration
>> save.image("final_subgroup_analysissubgroup_analysis.RData")
>>
>>
>> ______________________________________________
>>
> R-help@
> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Alameda, CA, USA
>
> ______________________________________________
> R-help@
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Interperting-results-of-glmnet-and-coxph-plot-Brier-score-and-Harrel-s-C-Index-am-I-doing-something--tp4677166p4677175.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list