[R] grubbs test to detect all outliers
Rui Barradas
ru|pb@rr@d@@ @end|ng |rom @@po@pt
Sat Apr 29 15:18:13 CEST 2023
Às 14:01 de 29/04/2023, AbouEl-Makarim Aboueissa escreveu:
> Hi Rui:
>
>
> How about this dataset, please see below. I included a few outliers in each
> column, as you can see in the printed dataset; please see below.
>
>
> Once again, thank you very much, and sorry if I bothered you all.
>
> abou
>
>
>
>> dput(datafortest)
> structure(list(factor1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
> 3L, 3L, NA, NA, NA, NA), levels = c("1", "2", "3"), class = "factor"),
> X = c(994455.077, 4348.031, 9999.789, 3813.139, 12.65, 5642.667,
> 876684.386, 5165.731, NA, 3259.241, 8.383, 1997.878, 99990.608,
> 2655.977, 9.49, 1826.851, 4386.002, 883295.091, 2120.902,
> NA, 2056.123, 5.088, NA, 92539.873, NA, NA, NA, NA), Y = c(76888L,
> 333L, 618L, 10L, 344L, NA, 3L, 86999L, 265L, 557L, 77777L,
> 383L, NA, NA, 87777L, 287L, 352L, 308L, 999526L, 489L, 2L,
> 444L, 9L, 333L, NA, NA, NA, NA), factor2 = structure(c(1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
> 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("1",
> "2", "3"), class = "factor"), Z = c(54999L, 475L, 15L, 603L,
> 442L, 79486L, 927L, 971L, 388L, 888L, 514L, 409L, 546L, 523L,
> 313L, 296L, 320L, 388L, 79999L, 677L, 555L, NA, 479L, 257L,
> 313L, 21L, 320L, 4L), U = c(NA, NA, 1.5, 332, 216, 217, 1000,
> 10, 9999, 444, NA, 5, 327, 58888, 456, 412, 251, 6, 398,
> 438, 428, 15, NA, 406, 334, 465, 180, 88999), V = c(12, 240,
> 9000, 265, NA, 99999, 1, 562, 13, 777, 322, NA, 99988, 653,
> 450, 576, NA, 396.5, 91888, 5, 219, NA, 321, 417, 409, 999999,
> 523, 10)), row.names = c(NA, -28L), class = "data.frame")
>>
>
>
>
>> datafortest
> factor1 X Y factor2 Z U V
> 1 1 994455.077 76888 1 54999 NA 12.0
> 2 1 4348.031 333 1 475 NA 240.0
> 3 1 9999.789 618 1 15 1.5 9000.0
> 4 1 3813.139 10 1 603 332.0 265.0
> 5 1 12.650 344 1 442 216.0 NA
> 6 1 5642.667 NA 1 79486 217.0 99999.0
> 7 1 876684.386 3 1 927 1000.0 1.0
> 8 2 5165.731 86999 1 971 10.0 562.0
> 9 2 NA 265 1 388 9999.0 13.0
> 10 2 3259.241 557 2 888 444.0 777.0
> 11 2 8.383 77777 2 514 NA 322.0
> 12 2 1997.878 383 2 409 5.0 NA
> 13 2 99990.608 NA 2 546 327.0 99988.0
> 14 2 2655.977 NA 2 523 58888.0 653.0
> 15 3 9.490 87777 2 313 456.0 450.0
> 16 3 1826.851 287 2 296 412.0 576.0
> 17 3 4386.002 352 2 320 251.0 NA
> 18 3 883295.091 308 2 388 6.0 396.5
> 19 3 2120.902 999526 3 79999 398.0 91888.0
> 20 3 NA 489 3 677 438.0 5.0
> 21 3 2056.123 2 3 555 428.0 219.0
> 22 3 5.088 444 3 NA 15.0 NA
> 23 3 NA 9 3 479 NA 321.0
> 24 3 92539.873 333 3 257 406.0 417.0
> 25 <NA> NA NA 3 313 334.0 409.0
> 26 <NA> NA NA 3 21 465.0 999999.0
> 27 <NA> NA NA 3 320 180.0 523.0
> 28 <NA> NA NA 3 4 88999.0 10.0
>>
>
>
>
> with many thanks
> abou
>
> ______________________
>
>
> *AbouEl-Makarim Aboueissa, PhD*
>
> *Professor, Mathematics and Statistics*
> *Graduate Coordinator*
>
> *Department of Mathematics and Statistics*
> *University of Southern Maine*
>
>
>
> On Sat, Apr 29, 2023 at 8:05 AM Rui Barradas <ruipbarradas using sapo.pt> wrote:
>
>> Às 14:09 de 28/04/2023, AbouEl-Makarim Aboueissa escreveu:
>>> *R: *Grubbs Test to detect all outliers Per group for all columns in a
>> data
>>> frame
>>>
>>>
>>>
>>> Dear All: good morning
>>>
>>> I have a dataset (as an example) with two column factors (factor1 and
>>> factor2) and 5 numerical columns (X,Y,Z,U,V). The X and Y columns have
>> same
>>> length as factor1; and Z, U, and V have same length as factor2. Please
>> see
>>> dataset is copied below. Please note that all dataset columns have NAs
>>> values.
>>>
>>> *Need help on this:*
>>>
>>>
>>> Can we use the grubbs.test() function to detect all outliers and replace
>> it
>>> by NA in X and Y datasets per group in factor1; and in Z, U, and V
>> datasets
>>> per group in factor2. Columns in the dataframe have different lengths,
>> but
>>> when I read the .csv file, R added NA values for the shorter columns.
>>>
>>> If you need the .csv data file, please let me know.
>>>
>>>
>>> Thank you very much for your help in advance.
>>>
>>>
>>>
>>>
>>> install.packages("outliers")
>>> library(outliers)
>>>
>>> datafortest<-read.csv("G:/data_for_test.csv", header=TRUE)
>>> datafortest
>>>
>>> datafortest<-data.frame(datafortest)
>>>
>>> datafortest$factor1<-as.factor(datafortest$factor1)
>>> datafortest$factor2<-as.factor(datafortest$factor2)
>>>
>>> str(datafortest)
>>>
>>> ##### tried to use grubbs.test() on a single column of the dataframe, but
>>> still not working
>>> tests.for.outliers.X<- grubbs.test(datafortest$X, na.rm = TRUE, type=11)
>>>
>>>
>>> ####################################
>>>
>>> *grubbs.test() on a single dataset: but this can only detect if the min
>> and
>>> the max are outliers.*
>>>
>>>
>>> xx999<-c(0.088,1,2,3,4,5,6,7,8,9,88,98,99)
>>> grubbs.test(xx999, type=11)
>>>
>>>
>>>
>>>
>>> With many thanks
>>>
>>> Abou
>>>
>>>
>>>
>>> factor1 X Y factor2 Z U
>>> V
>>> 1 4455.077 888 1 999 NA 999
>>> 1 4348.031 333 1 475 NA 240
>>> 1 9999.789 618 1 507 252 394
>>> 1 3813.139 417 1 603 332 265
>>> 1 7512.65 344 1 442 216 NA
>>> 1 5642.667 NA 1 486 217 275
>>> 1 6684.386 341 1 927 698 479
>>> 2 5165.731 999 1 971 311 562
>>> 2 NA 265 1 388 999 512
>>> 2 3259.241 557 2 888 444 777
>>> 2 3288.383 234 2 514 NA 322
>>> 2 1997.878 383 2 409 311 NA
>>> 2 99990.61 NA 2 546 327 728
>>> 2 2655.977 NA 2 523 228 653
>>> 3 3189.49 7777 2 313 456 450
>>> 3 1826.851 287 2 296 412 576
>>> 3 4386.002 352 2 320 251 NA
>>> 3 3295.091 308 2 388 888 396.5
>>> 3 2120.902 526 3 9999 398 888
>>> 3 NA 489 3 677 438 307
>>> 3 2056.123 291 3 555 428 219
>>> 3 1995.088 444 3 NA 319 NA
>>> 3 NA 349 3 479 NA 321
>>> 3 2539.873 333 3 257 406 417
>>> 3 313 334 409
>>> 3 296 465 546
>>> 3 320 180 523
>>> 3 388 999 313
>>>
>>>
>>>
>>> ______________________
>>>
>>>
>>> *AbouEl-Makarim Aboueissa, PhD*
>>>
>>> *Professor, Mathematics and Statistics*
>>> *Graduate Coordinator*
>>>
>>> *Department of Mathematics and Statistics*
>>> *University of Southern Maine*
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> Hello,
>>
>> With the data file you have attached I cannot reproduce any errors, all
>> went well at the first try.
>>
>>
>> library(outliers)
>>
>> fl <- "~/data_for_test.csv"
>> datafortest <- read.csv(fl)
>>
>> # these are not needed to run the test
>> datafortest$factor1 <- as.factor(datafortest$factor1)
>> datafortest$factor2 <- as.factor(datafortest$factor2)
>> str(datafortest)
>> #> 'data.frame': 28 obs. of 7 variables:
>> #> $ factor1: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 2 2 ...
>> #> $ X : num 4455 4348 10000 3813 7513 ...
>> #> $ Y : int 888 333 618 417 344 NA 341 999 265 557 ...
>> #> $ factor2: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 2 ...
>> #> $ Z : int 999 475 507 603 442 486 927 971 388 888 ...
>> #> $ U : int NA NA 252 332 216 217 698 311 999 444 ...
>> #> $ V : num 999 240 394 265 NA 275 479 562 512 777 ...
>> head(datafortest)
>> #> factor1 X Y factor2 Z U V
>> #> 1 1 4455.077 888 1 999 NA 999
>> #> 2 1 4348.031 333 1 475 NA 240
>> #> 3 1 9999.789 618 1 507 252 394
>> #> 4 1 3813.139 417 1 603 332 265
>> #> 5 1 7512.650 344 1 442 216 NA
>> #> 6 1 5642.667 NA 1 486 217 275
>>
>> ##### tried to use grubbs.test() on a single column of the dataframe, but
>> ##### still not working
>> grubbs.test(datafortest$X, type = 11)
>> #>
>> #> Grubbs test for two opposite outliers
>> #>
>> #> data: datafortest$X
>> #> G = 4.6640014, U = 0.0091756, p-value = 0.02867
>> #> alternative hypothesis: 1826.851 and 99990.608 are outliers
>>
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>>
>
Hello,
With this data set the problem seems to be what you want to consider an
outlier. Types 10 and 11 give radically different results.
From the help page, section Details:
First test (10) is used to detect if the sample dataset contains one
outlier, statistically different than the other values. Test is based by
calculating score of this outlier G (outlier minus mean and divided by
sd) and comparing it to appropriate critical values. Alternative method
is calculating ratio of variances of two datasets - full dataset and
dataset without outlier. The obtained value called U is bound with G by
simple formula.
Second test (11) is used to check if lowest and highest value are two
outliers on opposite tails of sample. It is based on calculation of
ratio of range to standard deviation of the sample.
Third test (20) calculates ratio of variance of full sample and sample
without two extreme observations. It is used to detect if dataset
contains two outliers on the same tail.
The results below seem to show that there are two outliers on the right
tail. Do you have reasons to believe this is true? But that's a
statistics question, the code runs fine.
library(outliers)
datafortest$factor1 <- as.factor(datafortest$factor1)
datafortest$factor2 <- as.factor(datafortest$factor2)
grubbs.test(datafortest$X, type = 10)
#>
#> Grubbs test for one outlier
#>
#> data: datafortest$X
#> G = 2.6106, U = 0.6422, p-value = 0.04389
#> alternative hypothesis: highest value 994455.077 is an outlier
grubbs.test(datafortest$X, type = 11)
#>
#> Grubbs test for two opposite outliers
#>
#> data: datafortest$X
#> G = 3.04754, U = 0.63726, p-value = 1
#> alternative hypothesis: 5.088 and 994455.077 are outliers
grubbs.test(datafortest$X, type = 20)
#>
#> Grubbs test for two outliers
#>
#> data: datafortest$X
#> U = 0.33892, p-value < 2.2e-16
#> alternative hypothesis: highest values 883295.091 , 994455.077 are
outliers
Hope this helps,
Rui Barradas
More information about the R-help
mailing list