[R] lm() and dffits

Sun Aug 31 22:07:41 CEST 2008

Ranney, Steven <steven.ranney <at> montana.edu> writes:

> 1) fit a simple lm(LW~LL)
> 2) calculate the dffits for those data points
> 3) remove those data points that are 2*sqrt(p/n) (where p=the number of 
> parameters and n=number of data points; p=3 in a linear model, correct?  
> Intercept, slope, and error term?)
> 4) rerun the model MINUS those data points
> 5) compare the two lm()
> 
> Now, each of these steps I can do seperately, but only by outputting the 
> dffits to a .csv then removing the large dffits by hand, reading the .csv 
> back into R, rerunning the lm(), and comparing the first lm() to the second 
> lm().  I would imagine that there is a better (easier, I hope!) way to doing 
> all of this.  Any ideas?  
> 

You could do the following:

# --------------------
x = rnorm(100)
y=rnorm(100)
y[40] = y[40]+30 # generate outliere
df = data.frame(x=x,y=y)
lmfit1 = lm(y~x, data=df) # fit all data
thresh = 3 # Choose any data-dependent threshold
nice = abs(dffits(lmfit)) < thresh
# note that nice[40] is the only  FALSE
df2 = df[nice,]
lmfit2 = lm(y~x, data=df2)

summary(lmfit1)
summary(lmfit2)
# --------------------

However, this is a bit Denver-Style Home-Brewery. Instead of using this 
ad-hoc method, you are probably better off using one of the robust methods, for
example in MASS.

Dieter