[R] Calculation of extremely low p-values (in lm)
Rui Barradas
ruipbarradas at sapo.pt
Mon Dec 3 16:01:45 CET 2012
Hello,
It's easy to see what's going on by reading the sources, to be open
source is one of the strong points of R, we know exactly how the values
are computed. A reviewer might like to have an explanation of what R does.
The op could check with Friedrich Leisch's "Creating R Packages: A
Tutorial", it's running example on S3 classes is precisely the linear
model. The relevant functions and the way to call them are as follows.
Note that the p-values are computed using the distribution function,
pt(), that gives the area under the density, and that the returned
values are multiplied by two, since the test is two-sided. I've edited
the code a bit, to give an other way of computing the p-values. The
results are the same as the results of R's summary.lm() in package stats
and the code is easy to follow.
linmodEst <- function(x, y){
## compute QR-decomposition of x
qx <- qr(x)
## compute (x'x)^(-1) x'y
coef <- solve.qr(qx, y)
## degrees of freedom and standard deviation of residuals
df <- nrow(x) - ncol(x)
sigma2 <- sum((y - x %*% coef)^2)/df
## compute sigma^2 * (x'x)^-1
vcov <- sigma2 * chol2inv(qx$qr)
colnames(vcov) <- rownames(vcov) <- colnames(x)
list(coefficients = coef,
vcov = vcov,
sigma = sqrt(sigma2),
df = df)
}
summary.linmod <- function(object, ...){
se <- sqrt(diag(object$vcov))
tval <- coef(object) / se
TAB <- cbind(Estimate = coef(object),
StdErr = se,
t.value = tval,
p.value = 2*pt(-abs(tval), df=object$df),
p.value2 = 2*pt(abs(tval), df=object$df, lower.tail = FALSE))
res <- list(call=object$call,
coefficients=TAB)
#class(res) <- "summary.linmod"
res
}
mod <- linmodEst(cbind(Const = 1, x_var), y_var)
summary.linmod(mod)
Hope this helps,
Rui Barradas
Em 03-12-2012 14:26, Robert Baer escreveu:
> On 12/3/2012 6:20 AM, Sindri wrote:
>> Dear R-users
>>
>> Please excuse me if this topic has been covered before, but I was
>> unable to
>> find anything relevant by searching
>>
>> I am currently doing a comparison of two biological variables that
>> have a
>> highly significant linear relationship. I know that the p-value of
>> linear
>> regression is not so interesting in itself, but this particular value
>> does
>> raise a question.
>>
>> How does R calculate (extremely low) p-values for linear regression?
>>
>> For my data I got a p-value on the order of 10^-9 and a reviewer
>> commented
>> on this. I tried to run the same analysis in both SAS and Sigmastat
>> to be
>> sure that I was doing it right, but both these programs only return a
>> p-value of p < 0.0001
>> Since I am unable to reproduce my results in another statistics
>> program, it
>> would be nice to be able to explain this unusally low p-value to the
>> reviewers.
> This is a matter of you understanding that the p-value is an area
> under a probability density curve. R is simply printing out the
> actual area in a tail of some distribution. The other statistical
> program is making the assumption that you are using the p-value to
> compare to a cutoff alpha value that is (in most fields) never set
> much below p<0.001. If p < alpha the "hypothesis test crowd" , would
> choose to reject NULL hypothesis, so the other statistics programs
> take the attitude -- "why provide more detail?". R chooses to give
> you the actual number and let you do what you will with it. You could
> probably benefit from reviewing hypothesis testing in a basic
> statistics book if this is not clear.
>
> Note that 10e-9 is indeed less than 0.0001, so the programs don't
> disagree. R just provides more detail.
>
>>
>> This "problem" can be illustrated with the following made-up data:
>>
>> x_var<-c(0.149,0.178,0.3474,0.167,0.121,0.182,0.176,0.448,0.091,0.083,0.090,0.407,0.378,0.132,0.227,0.172,0.088,0.392,0.425,0.150,0.319,0.190,0.171,0.290,0.214,0.431,0.193)
>>
>>
>> y_var<-c(0.918,0.394,0.131,0.9084,0.916,0.934,0.928,0.279,0.830,0.927,0.964,0.323,0.097,0.914,0.614,0.790,0.984,0.530,0.207,0.858,0.408,0.919,0.869,0.347,0.834,0.276,0.940)
>>
>>
>> fit<-lm(y_var~x_var)
>>
>>> summary(fit)
>> Call:
>> lm(formula = y_var ~ x_var)
>>
>> Residuals:
>> Min 1Q Median 3Q Max
>> -0.39152 -0.06027 0.00933 0.10024 0.22711
>>
>> Coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 1.18696 0.06394 18.562 3.90e-16 ***
>> x_var -2.25529 0.24788 -9.098 2.08e-09 ***
>> ---
>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Residual standard error: 0.1503 on 25 degrees of freedom
>> Multiple R-squared: 0.768, Adjusted R-squared: 0.7588
>> F-statistic: 82.78 on 1 and 25 DF, p-value: 2.083e-09
>>
>>
>> With kind regards,
>> Sindri Traustason
>>
>>
>>
>> -----
>> -----------------------------------------
>> Sindri Traustason
>> Glostrup Hospital Ophthalmology Research Dept.
>> Copenhagen, Demark
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Calculation-of-extremely-low-p-values-in-lm-tp4651823.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list