[R] outliers/interval data extraction

Jason Turner jasont at indigoindustrial.co.nz
Thu Feb 20 19:10:03 CET 2003

On Thu, Feb 20, 2003 at 06:37:48PM -0500, Rado Bonk wrote:
> Dear R-users,
> I have two outliers related questions.
> I.
> I have a vector consisting of 69 values.
> mean = 0.00086
> SD = 0.02152
> The shape of EDA graphics (boxplots, density plots) is heavily distorted
> due to outliers. How to define the interval for outliers exception? Is
> <2SD - mean + 2SD> interval a correct approach?


There's been a lot of discussion of this over the years; these
discussions usually  generate more heat than light.

<personal bias>
Throwing away outliers without further investigation is often
considered a bad idea.  The argument is that you get into a situation
where you are rejecting data because it doesn't fit the model, which
is a strange approach.  The most famous case of this was satelite
data on ozone thickness over Antarctica - the ozone hole was missed
for years because of an automatic outlier-rejection routine in the
data analysis.  If those outliers hadn't been rejected, the steps
taken could've been done sooner, avoiding a lot of dammage.

My own work is in industrial process control - if I ignored outliers,
I'd make an awful lot of very bad mistakes, and wouldn't have a job
for long. 

Outliers aren't necessarily wrong - sometimes the data is trying to
tell you something.
</personal bias>

Robust summaries are another way.  Check out the help pages for mad(),
IQR(), fivenum().  

Having said that, if you want to compare outlier-free data with your
raw data to help enlighten you about where those outliers might be
comming from, something like this might help...

ss <- mad(myvec)
mm <- median(myvec)
ind <- (myvec > mm - 3*ss & myvec < mm + 3*ss)
# or
ind2 <- (myvec > quantile(myvec,0.025) & myvec <quantile(myvec,0.975))



Indigo Industrial Controls Ltd.
jasont at indigoindustrial.co.nz

More information about the R-help mailing list