[R] polr in MASS
array chip
arrayprofile at yahoo.com
Tue May 6 18:27:57 CEST 2003
Hi,
It seems my last post had formatting problem, so
hopefully this time it works:
I am trying to use the proportional-odds model
(cumulative logistic regression) using the "polr"
function in the MASS library of R.
I tried with the dataset of "housing" contained in the
MASS book where "Sat" (3 levels: low, medium, high) is
the dependent variable, "Infl" (low, medium, high),
"Type" (tower, apartment, atrium, terrace) and "Cont"
(low, high) are the predictor variables. And I have
some questions; hope someone could help me out. The
following commands are taken from the MASS book as
well. If you know R, it might be better to understand
my problem, if not, I hope it's still ok - my problem
is actually a statistically problem, not a programming
problem, ignore the codes, but just look at the
output.
For the dataset, the proportional odds model tried to
run cumulative logistic regression of the dependent
variable "Sat" on the 3 predictor variables "Infl",
"Type", and "Cont" without interactions. In R, the
lowest level of each variable is treated as the
reference level.
> house.plr<-polr(Sat~Infl+Type+Cont,data=housing,
weights=Freq)
> summary(house.plr)
Call: polr(formula = Sat ~ Infl + Type + Cont, data =
housing, weights = Freq)
Coefficients:
Value Std. Error t value
InflMedium 0.5663922 0.10465276 5.412109
InflHigh 1.2888137 0.12715609 10.135682
TypeApartment -0.5723552 0.11923800 -4.800107
TypeAtrium -0.3661907 0.15517331 -2.359882
TypeTerrace -1.0910073 0.15148595 -7.202036
ContHigh 0.3602803 0.09553577 3.771156
Intercepts:
Value Std. Error t value
Low|Medium -0.4961 0.1248 -3.9740
Medium|High 0.6907 0.1255 5.5049
Residual Deviance: 3479.149
AIC: 3495.149
I also tried to predict the probabilities of the 3
categories of the dependent variable "Sat" for every
combination of the 3 predictor variables using the
predict function in R:
> hnames<-lapply(housing[,-5],levels)
>
house.pr1<-predict(house.plr,expand.grid(hnames[-1]),type="probs")
> cbind(expand.grid(hnames[-1]),round(house.pr1,2))
Infl Type Cont Low Medium High
1 Low Tower Low 0.38 0.29 0.33
2 Medium Tower Low 0.26 0.27 0.47
3 High Tower Low 0.14 0.21 0.65
4 Low Apartment Low 0.52 0.26 0.22
5 Medium Apartment Low 0.38 0.29 0.33
6 High Apartment Low 0.23 0.26 0.51
7 Low Atrium Low 0.47 0.27 0.26
8 Medium Atrium Low 0.33 0.29 0.38
9 High Atrium Low 0.19 0.25 0.56
10 Low Terrace Low 0.64 0.21 0.14
11 Medium Terrace Low 0.51 0.26 0.23
12 High Terrace Low 0.33 0.29 0.38
13 Low Tower High 0.30 0.28 0.42
14 Medium Tower High 0.19 0.25 0.56
15 High Tower High 0.10 0.17 0.72
16 Low Apartment High 0.43 0.28 0.29
17 Medium Apartment High 0.30 0.28 0.42
18 High Apartment High 0.17 0.23 0.60
19 Low Atrium High 0.38 0.29 0.33
20 Medium Atrium High 0.26 0.27 0.47
21 High Atrium High 0.14 0.21 0.64
22 Low Terrace High 0.56 0.25 0.19
23 Medium Terrace High 0.42 0.28 0.30
24 High Terrace High 0.26 0.27 0.47
What I am confused is that when I tried to reproduce
these predicted probabilities manually using the model
coefficients, I can sometimes get different results:
According to cumulative logistic regression, the model
is:
logit(P(Y<=k | x1,x2,x3) = a + b1*x1 + b2*x2 + b3*x3
for k=1,2,3
For example, for low Infl, Type tower, Cont low (all
on reference level),
logit(P(Sat=low))=P(Sat=low)/(1-P(Sat=low))=-0.4961,
solve for
P(Sat=low)=exp(-0.4961)/(1+exp(-0.4961))=0.38;
and
logit(P(Sat=low, medium)) =
P(Sat=low,medium)/(1-P(Sat=low,medium)) = 0.6907,
solve for P(Sat=low,medium) =
exp(0.6907)/(1+exp(0.6907))=0.67, which is the sum of
0.38 plus 0.29.
The above 2 examples showed that I can reproduce the
predicted probabilities when using the intercept
alone. However, for other combinations of the
predictor variables, I can NOT reproduce the results.
For example, for medium Infl, Type tower, cont low,
logit(P(Sat=low))=P(Sat=low)/(1-P(Sat=low))=-0.4961+0.56639=0.07029,
solve for
P(Sat=low)=exp(0.07029)/(1+exp(0.07029))=0.52; but the
probability using "predict" function is 0.26.
and
logit(P(Sat=low, medium)) =
P(Sat=low,medium)/(1-P(Sat=low,medium)) =
0.6907+0.56639=1.25709, solve for P(Sat=low,medium) =
exp(1.25709)/(1+exp(1.25709))=0.78, which certainly is
NOT the sum of 0.26 plus 0.27, and is not the sum of
0.52 and 0.27.
Did I misunderstand something here? Or the way I used
to reproduce the predicted probabilities is not
correct?
Thanks very much!
More information about the R-help
mailing list