[R] How to count number of year per firm in panel data?

Wed Feb 11 18:48:52 CET 2009

ave is also handy here:

DF$number <- ave(DF$year, DF$id, FUN = length)


On Wed, Feb 11, 2009 at 12:38 PM, Marc Schwartz
<marc_schwartz at comcast.net> wrote:
> on 02/11/2009 10:43 AM Johannes Habel wrote:
>> Hello,
>>
>> I have an unbalanced panel dataset and would like to exclude all objects
>> that don't appear at least x times.
>>
>> Therefore, I would like to include a column indicating for every line how
>> many periods are available, e.g.
>>
>> id, year, number
>> 1, 2000, 3
>> 1, 2001, 3
>> 1, 2002, 3
>> 2, 2001, 1
>> 3, ..., ...
>>
>> This would allow me to exclude companies by setting "subset=number>=x".
>> However, I don't know how to create this column, i.e. how to count the years
>> for each object and include the numbers into the dataset.
>>
>> Could anybody help me, please?
>>
>> Alternatively, is there an easier way to achieve my goal?
>>
>> Thank you very much.
>>
>> Johannes Habel
>
> You don't need to add the extra column. You can just create a frequency
> table of the unique 'id' values, get the subset of values that meet your
> count criteria and then use those values in subset().
>
> Let's create a little larger dataset:
>
> set.seed(1)
> id <- sample(letters[1:4], 8, replace = TRUE)
>
>> id
> [1] "b" "b" "c" "d" "a" "d" "d" "c"
>
>
> years <- unlist(lapply(split(id, id),
>                       function(i) 2000:(2000 + length(i) - 1)))
>
>> years
>   a   b1   b2   c1   c2   d1   d2   d3
> 2000 2000 2001 2000 2001 2000 2001 2002
>
>
> DF <- data.frame(id = sort(id), year = years)
>
>> DF
>   id year
> a   a 2000
> b1  b 2000
> b2  b 2001
> c1  c 2000
> c2  c 2001
> d1  d 2000
> d2  d 2001
> d3  d 2002
>
>> subset(DF, id %in% names(which(table(DF$id) >= 3)))
>   id year
> d1  d 2000
> d2  d 2001
> d3  d 2002
>
>
> Step by step:
>
>> table(DF$id)
>
> a b c d
> 1 2 2 3
>
>
>> table(DF$id) >= 3
>
>    a     b     c     d
> FALSE FALSE FALSE  TRUE
>
>
>> which(table(DF$id) >= 3)
> d
> 4
>
>
>> names(which(table(DF$id) >= 3))
> [1] "d"
>
>
> Then use subset() as above, filtering only those id's that are in the
> names from the table. If we change the requirement to >= 2:
>
>> subset(DF, id %in% names(which(table(DF$id) >= 2)))
>   id year
> b1  b 2000
> b2  b 2001
> c1  c 2000
> c2  c 2001
> d1  d 2000
> d2  d 2001
> d3  d 2002
>
>
> See ?table, ?names, ?which and ?"%in%"
>
>
> If you really need to add the column, you could use aggregate() to get a
> count of years for each id as a data frame, then use merge() to add the
> column to DF:
>
>> aggregate(DF$year, list(id = DF$id), length)
>  id x
> 1  a 1
> 2  b 2
> 3  c 2
> 4  d 3
>
>
>> merge(DF, aggregate(DF$year, list(id = DF$id), length), by = "id")
>  id year x
> 1  a 2000 1
> 2  b 2000 2
> 3  b 2001 2
> 4  c 2000 2
> 5  c 2001 2
> 6  d 2000 3
> 7  d 2001 3
> 8  d 2002 3
>
> Then use subset() as you initially considered.
>
> See ?aggregate and ?merge
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>