[R] FW: logistic regression
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Sat Sep 27 22:41:00 CEST 2008
Darin Brooks wrote:
> Sorry.
>
> Let me try again then.
>
> I am trying to find "significant" predictors" from a list of about 44
> independent variables. So I started with all 44 variables and ran
Why? What is wrong with insignificant predictors?
> drop1(sep22lr, test="Chisq")... and then dropped the highest p value from
> the run. Then I reran the drop1.
>
> Model:
> MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_MST_1 +
> SOIL_NUTR + cE + cN + cELEV + cDIAM_125 + cCRCLS + cCULM_125 +
> cSPH + cAGE + cVRI_NONPINE + cVRI_nonpineCFR + cVRI_BLEAF +
> cvol_125 + cstrDST_SW + cwaterDST_SW + cSEEDSRCE_SW + cMAT +
> cMWMT + cMCMT + cTD + cMAP + cMSP + cAHM + cSHM + cMATMAP +
> cddless0 + cddless18 + cddgrtr0 + cddgrtr18 + cNFFD + cbFFP +
> ceFFP + cPAS + cDD5_100 + cEXT_Cold + cS_INDX
> Df Deviance AIC LRT Pr(Chi)
> <none> 814.21 938.21
> ORG_CODE 4 824.97 940.97 10.76 0.0294100 *
> BECLBL08 9 845.61 951.61 31.41 0.0002519 ***
> PEM_SScat 10 829.11 933.11 14.90 0.1357580
> SOIL_MST_1 1 814.63 936.63 0.43 0.5135094
> SOIL_NUTR 2 818.49 938.49 4.28 0.1175411
> cE 1 814.37 936.37 0.16 0.6886085
> cN 1 814.40 936.40 0.20 0.6566765
> cELEV 1 814.35 936.35 0.14 0.7044864
> cDIAM_125 1 817.98 939.98 3.78 0.0519554 .
> cCRCLS 1 819.32 941.32 5.11 0.0237598 *
> cCULM_125 1 816.17 938.17 1.97 0.1606846
> cSPH 1 816.62 938.62 2.41 0.1204141
> cAGE 1 815.92 937.92 1.72 0.1902314
> cVRI_NONPINE 1 818.04 940.04 3.84 0.0501149 .
> cVRI_nonpineCFR 1 821.17 943.17 6.96 0.0083197 **
> cVRI_BLEAF 1 818.78 940.78 4.58 0.0324286 *
> cvol_125 1 814.67 936.67 0.47 0.4949495
> cstrDST_SW 1 814.63 936.63 0.42 0.5169757
> cwaterDST_SW 1 814.75 936.75 0.55 0.4592643
> cSEEDSRCE_SW 1 817.73 939.73 3.53 0.0604234 .
> cMAT 1 814.27 936.27 0.06 0.8002333
> cMWMT 1 814.49 936.49 0.28 0.5942246
> cMCMT 1 819.39 941.39 5.18 0.0228425 *
> cTD 1 816.20 938.20 1.99 0.1580332
> cMAP 1 814.25 936.25 0.04 0.8386626
> cMSP 1 818.41 940.41 4.20 0.0404411 *
> cAHM 1 815.66 937.66 1.46 0.2276311
> cSHM 1 819.95 941.95 5.75 0.0165227 *
> cMATMAP 1 814.91 936.91 0.71 0.4001878
> cddless0 1 818.04 940.04 3.83 0.0502153 .
> cddless18 1 817.81 939.81 3.60 0.0576931 .
> cddgrtr0 1 816.64 938.64 2.44 0.1184235
> cddgrtr18 1 815.77 937.77 1.57 0.2104958
> cNFFD 1 815.38 937.38 1.18 0.2782582
> cbFFP 1 814.39 936.39 0.18 0.6677481
> ceFFP 1 820.22 942.22 6.01 0.0141863 *
> cPAS 1 814.21 936.21 0.01 0.9347654
> cDD5_100 1 814.79 936.79 0.58 0.4447531
> cEXT_Cold 1 816.99 938.99 2.78 0.0954512 .
> cS_INDX 1 815.21 937.21 1.01 0.3157208
>
>
> And then systematically reran the drop1, removing the HIGHEST p value (least
> significant)from each resultant until only significant (0.10) variables
> remained.
>
> Model:
> MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR +
> cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
> Df Deviance AIC LRT Pr(Chi)
> <none> 884.20 946.20
> ORG_CODE 4 916.38 970.38 32.18 1.757e-06 ***
> BECLBL08 9 940.66 984.66 56.46 6.418e-09 ***
> PEM_SScat 11 906.20 946.20 22.00 0.0243795 *
> SOIL_NUTR 2 894.19 952.19 9.99 0.0067557 **
> cSEEDSRCE_SW 1 894.41 954.41 10.21 0.0013983 **
> cMSP 1 896.97 956.97 12.77 0.0003516 ***
> ceFFP 1 928.50 988.50 44.30 2.812e-11 ***
> cEXT_Cold 1 923.35 983.35 39.15 3.921e-10 ***
>
>
> I didn't create any kind of dummy or factor variables for my categorical
> data (at least, not on purpose).
>
> With a remaining 8 variables, I tried to run a logistic regression (glm)
> against my dependent variable(MIN_Mstocked). When I do a summary of the
Estimates from this model (and especially standard errors and P-values)
will be invalid because they do not take into account the stepwise
procedure above that was used to torture the data until they confessed.
Frank
> glm, I am provided with the usual table of estimate, std error, z value, and
> Pr(>|z|)... BUT there are some coefficients missing in the list. None of
> the categorical data is complete. Some are missing only one category, while
> others are missing 4 or 5 categories.
>
> e.g.
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -1.324e+02 1.363e+03 -0.097 0.922611
> ORG_CODE[T.DLA] -1.504e+01 1.363e+03 -0.011 0.991192
> ORG_CODE[T.DMO] -1.494e+01 1.363e+03 -0.011 0.991253
> ORG_CODE[T.DPG] -1.766e+01 1.363e+03 -0.013 0.989658
> ORG_CODE[T.DVA] -1.841e+01 1.363e+03 -0.014 0.989220
> BECLBL08[T.SBS dw 2] -6.733e-01 5.903e-01 -1.141 0.254033
> BECLBL08[T.SBS dw 3] -1.094e+00 5.714e-01 -1.914 0.055586 .
> BECLBL08[T.SBS mc 2] 1.573e-01 5.004e-01 0.314 0.753211
> BECLBL08[T.SBS mc 3] 1.402e+00 5.824e-01 2.408 0.016043 *
> BECLBL08[T.SBS mk 1] -2.388e+00 7.529e-01 -3.172 0.001514 **
> BECLBL08[T.SBS mw] -1.672e+01 1.393e+03 -0.012 0.990425
> BECLBL08[T.SBS vk] -1.614e+01 1.243e+03 -0.013 0.989640
> BECLBL08[T.SBS wk 1] -3.640e+00 8.174e-01 -4.453 8.48e-06 ***
> BECLBL08[T.SBS wk 3] -1.838e+01 1.363e+03 -0.013 0.989240
> PEM_SScat[T.B] -1.815e+01 3.956e+03 -0.005 0.996339
> PEM_SScat[T.C] 1.998e-01 3.925e-01 0.509 0.610792
> PEM_SScat[T.D] -2.314e-01 3.215e-01 -0.720 0.471621
> PEM_SScat[T.E] 5.581e-01 3.433e-01 1.626 0.104020
> PEM_SScat[T.F] -1.113e+00 5.782e-01 -1.926 0.054153 .
> PEM_SScat[T.G] 1.780e-01 4.420e-01 0.403 0.687150
> PEM_SScat[T.H] 1.670e+01 3.956e+03 0.004 0.996633
> PEM_SScat[T.I] 2.751e-01 9.313e-01 0.295 0.767705
> PEM_SScat[T.J] -2.623e-01 9.693e-01 -0.271 0.786649
> PEM_SScat[T.K] -1.862e+01 3.956e+03 -0.005 0.996244
> PEM_SScat[T.L] -1.661e+01 1.211e+03 -0.014 0.989056
> SOIL_NUTR[T.C] -1.119e+00 3.781e-01 -2.960 0.003073 **
> SOIL_NUTR[T.D] -7.912e-02 9.049e-01 -0.087 0.930320
> cSEEDSRCE_SW -1.512e-03 4.930e-04 -3.066 0.002170 **
> cMSP 1.808e-02 5.304e-03 3.409 0.000652 ***
> ceFFP 2.889e-01 4.662e-02 6.196 5.80e-10 ***
> cEXT_Cold -1.880e+00 3.330e-01 -5.647 1.63e-08 ***
>
> There should be a PEM_Sscat[T.A]. It is the most prevalent occurrence in
> this category.
>
> ORG_CODE is missing more than 6 categories in the list
>
> SOIL_NUTR should have a [T.B]
>
> Does that help?
>
> -----Original Message-----
> From: Kevin E. Thorpe [mailto:kevin.thorpe at utoronto.ca]
> Sent: Saturday, September 27, 2008 6:21 AM
> To: Darin Brooks
> Cc: r-help at r-project.org
> Subject: Re: [R] logistic regression
>
>
> Darin Brooks wrote:
>> Good afternoon
>>
>> I have what I hope is a simple logistic regression issue.
>>
>> I started with 44 independent variables and then used the drop1,
>> test="chisq" to reduce the list to 8 significant independent variables.
>>
>> drop1(sep22lr, test="Chisq") and wound up with this model:
>>
>> Model: MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR +
>> cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
>>
>> 4 of the remaining variables are categorical and 4 are continuous.
>>
>> However, when I run a glm and then a summary on the glm - some of the
>> categorical data is missing from the output.
>>
>> The PEM_SScat is missing only one variable ... the BECLBL08 is missing
>> several variables ... the ORG_CODE is missing 4 .. and the SOIL_NUTR
>> is missing 1 variable.
>>
>> It seems arbitrary to the number of variables missing. Is there
>> something wrong with my syntax in calling the logistic model? Am I not
> understanding
>> the inputs correctly?
>>
>> Any help would be appreciated.
>>
>
> I'm not sure I fully understand your question. It sounds like you created
> your own dummy variables for your categorical variables. Did you? Or did
> you use factor variables for your categorical variables?
> If the latter, then I REALLY don't understand your question.
>
> Kevin
>
> --
> Kevin E. Thorpe
> Biostatistician/Trialist, Knowledge Translation Program Assistant Professor,
> Dalla Lana School of Public Health University of Toronto
> email: kevin.thorpe at utoronto.ca Tel: 416.864.5776 Fax: 416.864.6057 No
> virus found in this incoming message.
> Checked by AVG - http://www.avg.com
>
> 6:55 PM
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list