[R] Counting occurences of variables in a dataframe

Sun Feb 12 08:11:06 CET 2012

On Sat, Feb 11, 2012 at 04:05:25PM -0500, David Winsemius wrote:
> 
> On Feb 11, 2012, at 1:17 PM, Kai Mx wrote:
> 
> >Hi everybody,
> >I have a large dataframe similar to this one:
> >knames <-c('ab', 'aa', 'ac', 'ad', 'ab', 'ac', 'aa', 'ad','ae', 'af')
> >kdate <- as.Date( c('20111001', '20111102', '20101001', '20100315',
> >'20101201', '20110105', '20101001', '20110504', '20110603',  
> >'20110201'),
> >format="%Y%m%d")
> >kdata <- data.frame (knames, kdate)
> 
> >  ave(unclass(kdate), knames, FUN=order )
>  [1] 2 2 1 1 1 2 1 2 1 1
> 
> 
> That was actually not using the dataframe values but you could also do  
> this:
> 
> > kdata$ord <- with(kdata, ave(unclass(kdate), knames, FUN=order ))
> > kdata
>    knames      kdate ord
> 1      ab 2011-10-01   2
> 2      aa 2011-11-02   2
> 3      ac 2010-10-01   1
> 4      ad 2010-03-15   1
> 5      ab 2010-12-01   1
> 6      ac 2011-01-05   2
> 7      aa 2010-10-01   1
> 8      ad 2011-05-04   2
> 9      ae 2011-06-03   1
> 10     af 2011-02-01   1

Hi.

This is a good solution, if there are at most two occurrences
of each name. If there are more occurrences, then function "order"
should be replaced by "rank". Replacing name "aa" at row 2 by "ab",
we get

  knames <-c('ab', 'ab', 'ac', 'ad', 'ab', 'ac', 'aa', 'ad','ae', 'af')
  kdate <- as.Date( c('20111001', '20111102', '20101001', '20100315',
  '20101201', '20110105', '20101001', '20110504', '20110603', '20110201'),
  format="%Y%m%d")
  kdata <- data.frame (knames, kdate)

  kdata$ord <- with(kdata, ave(unclass(kdate), knames, FUN=order))
  kdata$rank <- with(kdata, ave(unclass(kdate), knames, FUN=rank))
  kdata

     knames      kdate ord rank
  1      ab 2011-10-01   3    2
  2      ab 2011-11-02   1    3
  3      ac 2010-10-01   1    1
  4      ad 2010-03-15   1    1
  5      ab 2010-12-01   2    1
  6      ac 2011-01-05   2    2
  7      aa 2010-10-01   1    1
  8      ad 2011-05-04   2    2
  9      ae 2011-06-03   1    1
  10     af 2011-02-01   1    1

The names "ab" occur in the order row 5, row 1, row 2, so
row 1 should get index 2, row 2 index 3.

If some of the dates repeat, then rank() by default computes
the average index. In this case, the following function f()
may be used

  knames <-c('ab', 'ab', 'ac', 'ad', 'ab', 'ac', 'aa', 'ad','ae', 'af')
  kdate <- as.Date( c('20111001', '20111001', '20101001', '20100315',
  '20101201', '20110105', '20101001', '20110504', '20110603', '20110201'),
  format="%Y%m%d")
  kdata <- data.frame (knames, kdate)

  kdata$rank <- with(kdata, ave(unclass(kdate), knames, FUN=rank))
  f <- function(x) rank(x, ties.method="first")
  kdata$f <- with(kdata, ave(unclass(kdate), knames, FUN=f))
  kdata

     knames      kdate rank f
  1      ab 2011-10-01  2.5 2
  2      ab 2011-10-01  2.5 3
  3      ac 2010-10-01  1.0 1
  4      ad 2010-03-15  1.0 1
  5      ab 2010-12-01  1.0 1
  6      ac 2011-01-05  2.0 2
  7      aa 2010-10-01  1.0 1
  8      ad 2011-05-04  2.0 2
  9      ae 2011-06-03  1.0 1
  10     af 2011-02-01  1.0 1

Hope this helps.

Petr Savicky.