[R] how to find number of unique rows for combination of r columns
Boris Steipe
bor|@@@te|pe @end|ng |rom utoronto@c@
Fri Nov 8 16:49:48 CET 2019
Are you trying to eliminate duplicated rows from your dataframe? Because that would be better achieved with duplicated().
B.
> On 2019-11-08, at 10:32, Ana Marija <sokovic.anamarija using gmail.com> wrote:
>
> would you know how would I extract from my original data frame, just
> these unique rows?
> because this gives me only those 3 columns, and I want all columns
> from the original data frame
>
>> head(udt)
> chr pos gene_id
> 1 chr1 54490 ENSG00000227232
> 2 chr1 58814 ENSG00000227232
> 3 chr1 60351 ENSG00000227232
> 4 chr1 61920 ENSG00000227232
> 5 chr1 63671 ENSG00000227232
> 6 chr1 64931 ENSG00000227232
>
>> head(dt)
> chr pos gene_id pval_nominal pval_ret wl wr META
> 1: chr1 54490 ENSG00000227232 0.608495 0.783778 31.62278 21.2838 0.7475480
> 2: chr1 58814 ENSG00000227232 0.295211 0.897582 31.62278 21.2838 0.6031214
> 3: chr1 60351 ENSG00000227232 0.439788 0.867959 31.62278 21.2838 0.6907182
> 4: chr1 61920 ENSG00000227232 0.319528 0.601809 31.62278 21.2838 0.4032200
> 5: chr1 63671 ENSG00000227232 0.237739 0.988039 31.62278 21.2838 0.7482519
> 6: chr1 64931 ENSG00000227232 0.276679 0.907037 31.62278 21.2838 0.5974800
>
> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
>>
>> Thank you so much! Converting it to data frame resolved the issue!
>>
>> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
>> <gerrit.eichner using math.uni-giessen.de> wrote:
>>>
>>> It seems as if dt is not a (base R) data frame but a
>>> data table. I assume, you will have to transform dt
>>> into a data frame (maybe with as.data.frame) to be
>>> able to apply unique in the suggested way. However,
>>> I am not familiar with data tables. Perhaps somebody
>>> else can provide a more profound guess.
>>>
>>> Regards -- Gerrit
>>>
>>> ---------------------------------------------------------------------
>>> Dr. Gerrit Eichner Mathematical Institute, Room 212
>>> gerrit.eichner using math.uni-giessen.de Justus-Liebig-University Giessen
>>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany
>>> http://www.uni-giessen.de/eichner
>>> ---------------------------------------------------------------------
>>>
>>> Am 08.11.2019 um 16:02 schrieb Ana Marija:
>>>> I tried it but I got this error:
>>>>> udt <- unique(dt[c("chr", "pos", "gene_id")])
>>>> Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) :
>>>> When i is a data.table (or character vector), the columns to join by
>>>> must be specified using 'on=' argument (see ?data.table), by keying x
>>>> (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing
>>>> column names between x and i (i.e., a natural join). Keyed joins might
>>>> have further speed benefits on very large data due to x being sorted
>>>> in RAM.
>>>>
>>>> On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
>>>> <gerrit.eichner using math.uni-giessen.de> wrote:
>>>>>
>>>>> Hi, Ana,
>>>>>
>>>>> doesn't
>>>>>
>>>>> udt <- unique(dt[c("chr", "pos", "gene_id")])
>>>>> nrow(udt)
>>>>>
>>>>> get close to what you want?
>>>>>
>>>>> Hth -- Gerrit
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> Dr. Gerrit Eichner Mathematical Institute, Room 212
>>>>> gerrit.eichner using math.uni-giessen.de Justus-Liebig-University Giessen
>>>>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany
>>>>> http://www.uni-giessen.de/eichner
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>> Am 08.11.2019 um 15:38 schrieb Ana Marija:
>>>>>> Hello,
>>>>>>
>>>>>> I have a data frame like this:
>>>>>>
>>>>>>> head(dt,20)
>>>>>> chr pos gene_id pval_nominal pval_ret wl wr
>>>>>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838
>>>>>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838
>>>>>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838
>>>>>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838
>>>>>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838
>>>>>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838
>>>>>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838
>>>>>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838
>>>>>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838
>>>>>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838
>>>>>>
>>>>>> it is very big,
>>>>>>> dim(dt)
>>>>>> [1] 73719122 8
>>>>>>
>>>>>> To count number of unique rows for all 3 columns: chr, pos and gene_id
>>>>>> I could just join those 3 columns and than count. But how would I find
>>>>>> unique number of rows for these 4 columns without joining them?
>>>>>>
>>>>>> Thanks
>>>>>> Ana
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list