[R] how to find number of unique rows for combination of r columns
Rui Barradas
ru|pb@rr@d@@ @end|ng |rom @@po@pt
Sat Nov 9 00:02:17 CET 2019
Hello,
If performance is important, and with 73M rows it probably is, take a
look at this StackOverflow post.
[1] https://stackoverflow.com/a/36058634/8245406
Hope this helps,
Rui Barradas
Às 21:33 de 08/11/19, Martin Morgan escreveu:
> With this example
>
>> df = data.frame(a = c(1, 1, 2, 2), b = c(1, 1, 2, 3), value = 1:4)
>> df
> a b value
> 1 1 1 1
> 2 1 1 2
> 3 2 2 3
> 4 2 3 4
>
> The approach to drop duplicates in the first and second columns has as a consequence the arbitrary choice of 'value' for the duplicate entries -- why chose a value of '1' rather than '2' (or the average of 1 and 2, or a list containing all possible values, or...) for the rows duplicated in columns a and b?
>
>> df[!duplicated(df[,1:2]),]
> a b value
> 1 1 1 1
> 3 2 2 3
> 4 2 3 4
>
> In base R one might
>
>> aggregate(value ~ a + b, df, mean)
> a b value
> 1 1 1 1.5
> 2 2 2 3.0
> 3 2 3 4.0
>> aggregate(value ~ a + b, df, list)
> a b value
> 1 1 1 1, 2
> 2 2 2 3
> 3 2 3 4
>
> but handling several value-like columns would be hard(?)
>
> Using library(dplyr), I have
>
>> group_by(df, a, b) %>% summarize(mean_value = mean(value))
> # A tibble: 3 x 3
> # Groups: a [2]
> a b mean_value
> <dbl> <dbl> <dbl>
> 1 1 1 1.5
> 2 2 2 3
> 3 2 3 4
>
> or
>
>> group_by(df, a, b) %>% summarize(values = list(value))
> # A tibble: 3 x 3
> # Groups: a [2]
> a b values
> <dbl> <dbl> <list>
> 1 1 1 <int [2]>
> 2 2 2 <int [1]>
> 3 2 3 <int [1]>
>
> summarizing multiple columns with dplyr
>
>> df$v1 = 1:4
>> df$v2 = 4:1
>> group_by(df, a, b) %>% summarize(v1_mean = mean(v1), v2_median = median(v2))
> # A tibble: 3 x 4
> # Groups: a [2]
> a b v1_mean v2_median
> <dbl> <dbl> <dbl> <dbl>
> 1 1 1 1.5 3.5
> 2 2 2 3 2
> 3 2 3 4 1
>
> I do not know how performant this would be with data of your size.
>
> Martin Morgan
>
> On 11/8/19, 1:39 PM, "R-help on behalf of Ana Marija" <r-help-bounces using r-project.org on behalf of sokovic.anamarija using gmail.com> wrote:
>
> Thank you so much!!!
>
> On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:
> >
> > Correction:
> > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)])
> > df[!duplicated(df[,2:3]), ] ## Note the ! sign
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:
> >>
> >> Sorry, but you ask basic questions.You really need to spend some more time with an R tutorial or two. This list is not meant to replace your own learning efforts.
> >>
> >> You also do not seem to be reading the docs carefully. Under ?unique, it links ?duplicated and tells you that it gives indices of duplicated rows of a data frame. These then can be used by subscripting to remove those rows from the data frame. Here is a reproducible example:
> >>
> >> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)])
> >> df[-duplicated(df[,2:3]), ] ## Note the - sign
> >>
> >> If you prefer, the "Tidyverse" world has what are purported to be more user-friendly versions of such data handling functionality that you can use instead.
> >>
> >>
> >> Bert
> >>
> >> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >>>
> >>> would you know how would I extract from my original data frame, just
> >>> these unique rows?
> >>> because this gives me only those 3 columns, and I want all columns
> >>> from the original data frame
> >>>
> >>> > head(udt)
> >>> chr pos gene_id
> >>> 1 chr1 54490 ENSG00000227232
> >>> 2 chr1 58814 ENSG00000227232
> >>> 3 chr1 60351 ENSG00000227232
> >>> 4 chr1 61920 ENSG00000227232
> >>> 5 chr1 63671 ENSG00000227232
> >>> 6 chr1 64931 ENSG00000227232
> >>>
> >>> > head(dt)
> >>> chr pos gene_id pval_nominal pval_ret wl wr META
> >>> 1: chr1 54490 ENSG00000227232 0.608495 0.783778 31.62278 21.2838 0.7475480
> >>> 2: chr1 58814 ENSG00000227232 0.295211 0.897582 31.62278 21.2838 0.6031214
> >>> 3: chr1 60351 ENSG00000227232 0.439788 0.867959 31.62278 21.2838 0.6907182
> >>> 4: chr1 61920 ENSG00000227232 0.319528 0.601809 31.62278 21.2838 0.4032200
> >>> 5: chr1 63671 ENSG00000227232 0.237739 0.988039 31.62278 21.2838 0.7482519
> >>> 6: chr1 64931 ENSG00000227232 0.276679 0.907037 31.62278 21.2838 0.5974800
> >>>
> >>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >>> >
> >>> > Thank you so much! Converting it to data frame resolved the issue!
> >>> >
> >>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
> >>> > <gerrit.eichner using math.uni-giessen.de> wrote:
> >>> > >
> >>> > > It seems as if dt is not a (base R) data frame but a
> >>> > > data table. I assume, you will have to transform dt
> >>> > > into a data frame (maybe with as.data.frame) to be
> >>> > > able to apply unique in the suggested way. However,
> >>> > > I am not familiar with data tables. Perhaps somebody
> >>> > > else can provide a more profound guess.
> >>> > >
> >>> > > Regards -- Gerrit
> >>> > >
> >>> > > ---------------------------------------------------------------------
> >>> > > Dr. Gerrit Eichner Mathematical Institute, Room 212
> >>> > > gerrit.eichner using math.uni-giessen.de Justus-Liebig-University Giessen
> >>> > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany
> >>> > > http://www.uni-giessen.de/eichner
> >>> > > ---------------------------------------------------------------------
> >>> > >
> >>> > > Am 08.11.2019 um 16:02 schrieb Ana Marija:
> >>> > > > I tried it but I got this error:
> >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")])
> >>> > > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) :
> >>> > > > When i is a data.table (or character vector), the columns to join by
> >>> > > > must be specified using 'on=' argument (see ?data.table), by keying x
> >>> > > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing
> >>> > > > column names between x and i (i.e., a natural join). Keyed joins might
> >>> > > > have further speed benefits on very large data due to x being sorted
> >>> > > > in RAM.
> >>> > > >
> >>> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
> >>> > > > <gerrit.eichner using math.uni-giessen.de> wrote:
> >>> > > >>
> >>> > > >> Hi, Ana,
> >>> > > >>
> >>> > > >> doesn't
> >>> > > >>
> >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")])
> >>> > > >> nrow(udt)
> >>> > > >>
> >>> > > >> get close to what you want?
> >>> > > >>
> >>> > > >> Hth -- Gerrit
> >>> > > >>
> >>> > > >> ---------------------------------------------------------------------
> >>> > > >> Dr. Gerrit Eichner Mathematical Institute, Room 212
> >>> > > >> gerrit.eichner using math.uni-giessen.de Justus-Liebig-University Giessen
> >>> > > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany
> >>> > > >> http://www.uni-giessen.de/eichner
> >>> > > >> ---------------------------------------------------------------------
> >>> > > >>
> >>> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija:
> >>> > > >>> Hello,
> >>> > > >>>
> >>> > > >>> I have a data frame like this:
> >>> > > >>>
> >>> > > >>>> head(dt,20)
> >>> > > >>> chr pos gene_id pval_nominal pval_ret wl wr
> >>> > > >>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838
> >>> > > >>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838
> >>> > > >>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838
> >>> > > >>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838
> >>> > > >>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838
> >>> > > >>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838
> >>> > > >>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838
> >>> > > >>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838
> >>> > > >>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838
> >>> > > >>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838
> >>> > > >>>
> >>> > > >>> it is very big,
> >>> > > >>>> dim(dt)
> >>> > > >>> [1] 73719122 8
> >>> > > >>>
> >>> > > >>> To count number of unique rows for all 3 columns: chr, pos and gene_id
> >>> > > >>> I could just join those 3 columns and than count. But how would I find
> >>> > > >>> unique number of rows for these 4 columns without joining them?
> >>> > > >>>
> >>> > > >>> Thanks
> >>> > > >>> Ana
> >>> > > >>>
> >>> > > >>> ______________________________________________
> >>> > > >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> > > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> > > >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> > > >>> and provide commented, minimal, self-contained, reproducible code.
> >>> > > >>>
> >>> > > >>
> >>> > > >> ______________________________________________
> >>> > > >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> > > >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> > > >> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list