[R] How to delete duplicate cases?

Thu Jul 24 16:34:03 CEST 2008

on 07/24/2008 09:00 AM Daniel Wagner wrote:
> Dear R users,
>  
> I have a dataframe with lot of duplicate cases and I want to delete duplicate ones which have low rank and keep that case which has highest rank.
> e.g
>  
>> df1
>   cno      rank
> 1  1342    0.23
> 2  1342    0.14
> 3  1342    0.56
> 4  2568    0.15
> 5  2568    0.89
>  
> so I want to keep 3rd and 5th  cases with highest rank (0.56 & 0.89) and delete rest of the duplicate cases.
> Could somebody help me?
>  
> Regards
>  
> Daniel
> Amsterdam

For the simple two column case, see ?aggregate:

 > aggregate(dfl$rank, list(cno = dfl$cno), max)
    cno    x
1 1342 0.56
2 2568 0.89

A more generic approach might be:

 > do.call(rbind, lapply(split(dfl, dfl$cno),
                         function(x) x[which.max(x$rank), ]))
       cno rank
1342 1342 0.56
2568 2568 0.89

For example, using the iris dataset, get the rows, by Species, with the 
highest Sepal.Length:

 > do.call(rbind, lapply(split(iris, iris$Species),
                         function(x) x[which.max(x$Sepal.Length), ]))
            Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
setosa              5.8         4.0          1.2         0.2     setosa
versicolor          7.0         3.2          4.7         1.4 versicolor
virginica           7.9         3.8          6.4         2.0  virginica

HTH,

Marc Schwartz