[R] P-value and R-squared variable selection criteria

Daniel Malter daniel at umd.edu
Thu Sep 24 15:59:50 CEST 2009


Don't throw out the baby with the bath water just yet. Note that even though
your first model is insignificant, the R-squared is very high. This is
because you fit the whole model with intercept and three coefficients on 1
degree of freedom. You need to first import the data, then run the model,
and then decide which coefficients to include.

Second, you may have data redundancy issues, for example, if altitude
correlates with longitude or latitude (especially, since you have so few
stations from a very restricted region, this  seems more likely than for
larger regions). Check the correlations. If they are high, you may think
about data reduction strategies (e.g. principal components analysis).

Further, your data is panel data (where the cross-section is the 10 stations
and the time series is the 2004 to 2008 monthly data). Thus, it is very
likely that fitting OLS without recognizing the dependence of the
time-series within each station is problematic. On top, there is certainly
correlation across stations, e.g., due to seasonal patterns that you may
want to account for.

That said, if you want to step down a model to exclude the insignificant
predictor variables one by one (more specifically, those with a t-value
smaller than 1), use "step"

x1=rnorm(100)
x2=rnorm(100)
x3=rnorm(100)
x4=rnorm(100)
e=rnorm(100,0,2)

y=x1+x3+e

reg=lm(y~x1+x2+x3+x4)
summary(reg)

step(reg2)

reg2=lm(y~x1+x3)
summary(reg2)

HTH
Daniel


-------------------------
cuncta stricte discussurus
-------------------------

-----Ursprüngliche Nachricht-----
Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im
Auftrag von Lucas Sevilla García
Gesendet: Thursday, September 24, 2009 8:49 AM
An: r-help at r-project.org
Betreff: [R] P-value and R-squared variable selection criteria


Hi R community

I have a question. I'll explain my situation. I have to build a climate
model to obtain monthly and annual temperature from 2004 to 2008 from a
specif area in Almeria (Spain). To build this climate model, I will use
Multiple regression. My dependant variable will be monthly and annual
temperature and independant variables will be Latitute, Longitude and
Altitude and I will work with climate data from 10 climate stations
distributed in my area of interest.  I have to fit the climate model from
the data to get temperature for each month. And I need to use p-value and
r-squared adjusted from the model to obtain the best fit. I'll put an
example. My climate data will be:

 V1 V2 V3 V4  V5
1  1 18  3  6 187
2  2 21  6  8  68
3  3 23  9  5  42
4  4 19  8  2 194
5  5 17  3  2 225

(V1 - climate station, V2 - temperature, V3 - Latitude, V4 - Longitude, V5 -
Altitude)

I fit the model to the data

 fit(V2~V3+V4+V5, data=clima)

And I get 

Call:
lm(formula = V2 ~ V3 + V4 + V5, data = clima)

Residuals:
       1        2        3        4        5 
 0.24684 -0.25200  0.17487 -0.05865 -0.11107 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 22.103408   2.526638   8.748   0.0725 .
V3           0.236477   0.152067   1.555   0.3638  
V4          -0.073973   0.169716  -0.436   0.7383  
V5          -0.024684   0.006951  -3.551   0.1748  
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 0.4133 on 1 degrees of freedom
Multiple R-squared: 0.9926,     Adjusted R-squared: 0.9706 
F-statistic: 44.95 on 3 and 1 DF,  p-value: 0.1091 

P- value for this model is 0.1091

However, I see that variable V4 has a really high p-value, so if I take it
out, my model will have a better p-value. So:

fit2<-lm(V2~V4+V5)

Call:
lm(formula = V2 ~ V4 + V5, data = clima)

Residuals:
       1        2        3        4        5 
 0.28356 -0.21880  0.05952  0.40918 -0.53346 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 25.764478   1.199212  21.485  0.00216 **
V4          -0.278286   0.140452  -1.981  0.18606   
V5          -0.034109   0.004451  -7.664  0.01660 * 
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 0.5403 on 2 degrees of freedom
Multiple R-squared: 0.9748,     Adjusted R-squared: 0.9497 
F-statistic: 38.74 on 2 and 2 DF,  p-value: 0.02516 

My new p value for the model is lower, and better. So, this is what I have
to do, I have to import climate data, and build the climate model using
those independant variables that give me the best p-value for the model, and
I have to do it automatic (since this example I did it manual). So, my
question after all this long explanation. Is there a package u order I can
download to apply selection of independent variables using as criteria
p-value and adjusted R-squered, or on the contrary, I have to build what I
need by myself. I guess I can build it by myself but it will take me a while
but I would like to know if there is some package to help to do it faster.
Well, thanks in advance.

Lucas
 		 	   		  
_________________________________________________________________
Nuevo Windows Live, un mundo lleno de posibilidades. Desczbrelo.
http://www.microsoft.com/windows/windowslive/default.aspx
	[[alternative HTML version deleted]]




More information about the R-help mailing list