[R] grubbs test to detect all outliers

AbouEl-Makarim Aboueissa @boue|m@k@r|m1962 @end|ng |rom gm@||@com
Sat Apr 29 15:01:24 CEST 2023


Hi Rui:


How about this dataset, please see below. I included a few outliers in each
column, as you can see in the printed dataset; please see below.


Once again, thank you very much, and sorry if I bothered you all.

abou



> dput(datafortest)
structure(list(factor1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, NA, NA, NA, NA), levels = c("1", "2", "3"), class = "factor"),
    X = c(994455.077, 4348.031, 9999.789, 3813.139, 12.65, 5642.667,
    876684.386, 5165.731, NA, 3259.241, 8.383, 1997.878, 99990.608,
    2655.977, 9.49, 1826.851, 4386.002, 883295.091, 2120.902,
    NA, 2056.123, 5.088, NA, 92539.873, NA, NA, NA, NA), Y = c(76888L,
    333L, 618L, 10L, 344L, NA, 3L, 86999L, 265L, 557L, 77777L,
    383L, NA, NA, 87777L, 287L, 352L, 308L, 999526L, 489L, 2L,
    444L, 9L, 333L, NA, NA, NA, NA), factor2 = structure(c(1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("1",
    "2", "3"), class = "factor"), Z = c(54999L, 475L, 15L, 603L,
    442L, 79486L, 927L, 971L, 388L, 888L, 514L, 409L, 546L, 523L,
    313L, 296L, 320L, 388L, 79999L, 677L, 555L, NA, 479L, 257L,
    313L, 21L, 320L, 4L), U = c(NA, NA, 1.5, 332, 216, 217, 1000,
    10, 9999, 444, NA, 5, 327, 58888, 456, 412, 251, 6, 398,
    438, 428, 15, NA, 406, 334, 465, 180, 88999), V = c(12, 240,
    9000, 265, NA, 99999, 1, 562, 13, 777, 322, NA, 99988, 653,
    450, 576, NA, 396.5, 91888, 5, 219, NA, 321, 417, 409, 999999,
    523, 10)), row.names = c(NA, -28L), class = "data.frame")
>



> datafortest
   factor1          X      Y factor2     Z       U        V
1        1 994455.077  76888       1 54999      NA     12.0
2        1   4348.031    333       1   475      NA    240.0
3        1   9999.789    618       1    15     1.5   9000.0
4        1   3813.139     10       1   603   332.0    265.0
5        1     12.650    344       1   442   216.0       NA
6        1   5642.667     NA       1 79486   217.0  99999.0
7        1 876684.386      3       1   927  1000.0      1.0
8        2   5165.731  86999       1   971    10.0    562.0
9        2         NA    265       1   388  9999.0     13.0
10       2   3259.241    557       2   888   444.0    777.0
11       2      8.383  77777       2   514      NA    322.0
12       2   1997.878    383       2   409     5.0       NA
13       2  99990.608     NA       2   546   327.0  99988.0
14       2   2655.977     NA       2   523 58888.0    653.0
15       3      9.490  87777       2   313   456.0    450.0
16       3   1826.851    287       2   296   412.0    576.0
17       3   4386.002    352       2   320   251.0       NA
18       3 883295.091    308       2   388     6.0    396.5
19       3   2120.902 999526       3 79999   398.0  91888.0
20       3         NA    489       3   677   438.0      5.0
21       3   2056.123      2       3   555   428.0    219.0
22       3      5.088    444       3    NA    15.0       NA
23       3         NA      9       3   479      NA    321.0
24       3  92539.873    333       3   257   406.0    417.0
25    <NA>         NA     NA       3   313   334.0    409.0
26    <NA>         NA     NA       3    21   465.0 999999.0
27    <NA>         NA     NA       3   320   180.0    523.0
28    <NA>         NA     NA       3     4 88999.0     10.0
>



with many thanks
abou

______________________


*AbouEl-Makarim Aboueissa, PhD*

*Professor, Mathematics and Statistics*
*Graduate Coordinator*

*Department of Mathematics and Statistics*
*University of Southern Maine*



On Sat, Apr 29, 2023 at 8:05 AM Rui Barradas <ruipbarradas using sapo.pt> wrote:

> Às 14:09 de 28/04/2023, AbouEl-Makarim Aboueissa escreveu:
> > *R: *Grubbs Test to detect all outliers Per group for all columns in a
> data
> > frame
> >
> >
> >
> > Dear All: good morning
> >
> > I have a dataset (as an example) with two column factors (factor1 and
> > factor2) and 5 numerical columns (X,Y,Z,U,V). The X and Y columns have
> same
> > length as factor1; and Z, U, and V have same length as factor2. Please
> see
> > dataset is copied below. Please note that all dataset columns have NAs
> > values.
> >
> > *Need help on this:*
> >
> >
> > Can we use the grubbs.test() function to detect all outliers and replace
> it
> > by NA in X and Y datasets per group in factor1; and in Z, U, and V
> datasets
> > per group in factor2. Columns in the dataframe have different lengths,
> but
> > when I read the .csv file, R added NA values for the shorter columns.
> >
> > If you need the .csv data file, please let me know.
> >
> >
> > Thank you very much for your help in advance.
> >
> >
> >
> >
> > install.packages("outliers")
> > library(outliers)
> >
> > datafortest<-read.csv("G:/data_for_test.csv", header=TRUE)
> > datafortest
> >
> > datafortest<-data.frame(datafortest)
> >
> > datafortest$factor1<-as.factor(datafortest$factor1)
> > datafortest$factor2<-as.factor(datafortest$factor2)
> >
> > str(datafortest)
> >
> > ##### tried to use grubbs.test() on a single column of the dataframe, but
> > still not working
> > tests.for.outliers.X<- grubbs.test(datafortest$X, na.rm = TRUE, type=11)
> >
> >
> > ####################################
> >
> > *grubbs.test() on a single dataset: but this can only detect if the min
> and
> > the max are outliers.*
> >
> >
> > xx999<-c(0.088,1,2,3,4,5,6,7,8,9,88,98,99)
> > grubbs.test(xx999, type=11)
> >
> >
> >
> >
> > With many thanks
> >
> > Abou
> >
> >
> >
> > factor1      X            Y         factor2          Z           U
> >    V
> > 1     4455.077 888 1 999           NA 999
> > 1     4348.031 333 1 475            NA 240
> > 1    9999.789 618 1 507 252 394
> > 1    3813.139 417 1 603 332 265
> > 1  7512.65 344 1 442 216           NA
> > 1     5642.667            NA 1 486 217 275
> > 1     6684.386 341 1 927 698 479
> > 2     5165.731 999 1 971 311 562
> > 2 NA 265 1 388 999 512
> > 2     3259.241 557 2 888 444 777
> > 2     3288.383 234 2 514            NA 322
> > 2      1997.878 383 2 409 311           NA
> > 2       99990.61           NA 2 546 327 728
> > 2       2655.977          NA 2 523 228 653
> > 3      3189.49 7777 2 313 456 450
> > 3      1826.851 287 2 296 412 576
> > 3      4386.002 352 2 320 251         NA
> > 3      3295.091 308 2 388 888 396.5
> > 3      2120.902 526 3 9999 398 888
> > 3 NA 489 3 677 438 307
> > 3      2056.123 291 3 555 428 219
> > 3      1995.088 444 3              NA 319           NA
> > 3 NA 349 3 479           NA 321
> > 3      2539.873 333 3 257 406 417
> >        3 313 334 409
> >        3 296 465 546
> >        3 320 180 523
> >        3 388 999 313
> >
> >
> >
> > ______________________
> >
> >
> > *AbouEl-Makarim Aboueissa, PhD*
> >
> > *Professor, Mathematics and Statistics*
> > *Graduate Coordinator*
> >
> > *Department of Mathematics and Statistics*
> > *University of Southern Maine*
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> With the data file you have attached I cannot reproduce any errors, all
> went well at the first try.
>
>
> library(outliers)
>
> fl <- "~/data_for_test.csv"
> datafortest <- read.csv(fl)
>
> # these are not needed to run the test
> datafortest$factor1 <- as.factor(datafortest$factor1)
> datafortest$factor2 <- as.factor(datafortest$factor2)
> str(datafortest)
> #> 'data.frame':    28 obs. of  7 variables:
> #>  $ factor1: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 2 2 ...
> #>  $ X      : num  4455 4348 10000 3813 7513 ...
> #>  $ Y      : int  888 333 618 417 344 NA 341 999 265 557 ...
> #>  $ factor2: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 2 ...
> #>  $ Z      : int  999 475 507 603 442 486 927 971 388 888 ...
> #>  $ U      : int  NA NA 252 332 216 217 698 311 999 444 ...
> #>  $ V      : num  999 240 394 265 NA 275 479 562 512 777 ...
> head(datafortest)
> #>   factor1        X   Y factor2   Z   U   V
> #> 1       1 4455.077 888       1 999  NA 999
> #> 2       1 4348.031 333       1 475  NA 240
> #> 3       1 9999.789 618       1 507 252 394
> #> 4       1 3813.139 417       1 603 332 265
> #> 5       1 7512.650 344       1 442 216  NA
> #> 6       1 5642.667  NA       1 486 217 275
>
> ##### tried to use grubbs.test() on a single column of the dataframe, but
> ##### still not working
> grubbs.test(datafortest$X, type = 11)
> #>
> #>  Grubbs test for two opposite outliers
> #>
> #> data:  datafortest$X
> #> G = 4.6640014, U = 0.0091756, p-value = 0.02867
> #> alternative hypothesis: 1826.851 and 99990.608 are outliers
>
>
>
> Hope this helps,
>
> Rui Barradas
>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list