[R] Formula for whether hat value is influential?
John Fox
jfox at mcmaster.ca
Sun Mar 9 14:35:52 CET 2008
Dear Gavin and Paul,
(k + 1)/n is the average hatvalue. The 2(k + 1)/n rule comes from results in
Belsley, Kuh, and Welsch (1980), Regression Diagnostics, concerning the
distribution of the hatvalues when n is large relative to k + 1, and when X
is multivariate normal. For smaller n, this tends to nominate too many
points, and thus suggests the rule 3(k + 1)/n, which I think is also due to
Belsley et al.
I'd prefer to call such hatvalues "noteworthy" rather than "influential,"
since hatvalues measure "leverage" on the least-squares fit and not
influence (on the coefficients).
Finally, I think that it's a better idea to examine diagnostics like
hatvalues graphically rather than paying too much attention to numerical
cutoffs.
Regards,
John
--------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Gavin Simpson
> Sent: March-09-08 5:54 AM
> To: Paul Lynch
> Cc: r-help at r-project.org
> Subject: Re: [R] Formula for whether hat value is influential?
>
> On Sat, 2008-03-08 at 19:38 -0800, Paul Lynch wrote:
> > I was wondering if someone might be able to tell me what formula R's
> > influence.measures function uses for determining whether the hat
> value
> > it computes is influential (i.e., the true/false value in the "hat"
> > column of the returned is.inf data frame). The reason I'm asking is
> > that its results disagree with what I've just learned in my
> statistics
> > class, namely that a point should be considered influential if h_ii >
> > 2(k+1)/n, where k+1 is the number of parameters in the model and n is
> > the number of data points. My 2(k+1)/n value would mark at least one
> > more point influential than influence.measures does for the data set
> > I'm looking at.
>
> This is R, which because it is open source, you have access to all the
> source code - type influence.measures (without () )at the prompt to see
> a version without any comments.
>
> In the in-line function is.influential(), you'll find the critical
> levels used. The hat values are in infmat[, k + 4], which is the last
> column (where k is the number of terms in the model, inc. the intercept
> if present). The relevant part of is.influential is:
>
> infmat[, k + 4] > (3 * k)/n
>
> So R is using (3*(k+1)) / n in your notation (in the R code k is the
> number of terms in the model, *including* the intercept if present in
> the model).
>
> The function was originally in John Fox's car package that is support
> software for his book Companion to Applied Regression. In that book,
> IIRC, Fox uses two cut-offs for hat values or 2 or 3 times the average
> hat value as indicating influential observations. R is using the upper
> level here. I would check out some of the references cited in the
> References section of ?influence.measures to see why this has been
> chosen.
>
> HTH
>
> G
>
> >
> > I am using R 2.4.1 under Windows. (Upgrading is difficult due to
> > rather severe security policies.)
> >
> > Thanks,
> >
> > --Paul
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> --
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> Dr. Gavin Simpson [t] +44 (0)20 7679 0522
> ECRC, UCL Geography, [f] +44 (0)20 7679 0565
> Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
> Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/
> UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list