[R] How to count number of year per firm in panel data?
Gabor Grothendieck
ggrothendieck at gmail.com
Wed Feb 11 18:48:52 CET 2009
ave is also handy here:
DF$number <- ave(DF$year, DF$id, FUN = length)
On Wed, Feb 11, 2009 at 12:38 PM, Marc Schwartz
<marc_schwartz at comcast.net> wrote:
> on 02/11/2009 10:43 AM Johannes Habel wrote:
>> Hello,
>>
>> I have an unbalanced panel dataset and would like to exclude all objects
>> that don't appear at least x times.
>>
>> Therefore, I would like to include a column indicating for every line how
>> many periods are available, e.g.
>>
>> id, year, number
>> 1, 2000, 3
>> 1, 2001, 3
>> 1, 2002, 3
>> 2, 2001, 1
>> 3, ..., ...
>>
>> This would allow me to exclude companies by setting "subset=number>=x".
>> However, I don't know how to create this column, i.e. how to count the years
>> for each object and include the numbers into the dataset.
>>
>> Could anybody help me, please?
>>
>> Alternatively, is there an easier way to achieve my goal?
>>
>> Thank you very much.
>>
>> Johannes Habel
>
> You don't need to add the extra column. You can just create a frequency
> table of the unique 'id' values, get the subset of values that meet your
> count criteria and then use those values in subset().
>
> Let's create a little larger dataset:
>
> set.seed(1)
> id <- sample(letters[1:4], 8, replace = TRUE)
>
>> id
> [1] "b" "b" "c" "d" "a" "d" "d" "c"
>
>
> years <- unlist(lapply(split(id, id),
> function(i) 2000:(2000 + length(i) - 1)))
>
>> years
> a b1 b2 c1 c2 d1 d2 d3
> 2000 2000 2001 2000 2001 2000 2001 2002
>
>
> DF <- data.frame(id = sort(id), year = years)
>
>> DF
> id year
> a a 2000
> b1 b 2000
> b2 b 2001
> c1 c 2000
> c2 c 2001
> d1 d 2000
> d2 d 2001
> d3 d 2002
>
>> subset(DF, id %in% names(which(table(DF$id) >= 3)))
> id year
> d1 d 2000
> d2 d 2001
> d3 d 2002
>
>
> Step by step:
>
>> table(DF$id)
>
> a b c d
> 1 2 2 3
>
>
>> table(DF$id) >= 3
>
> a b c d
> FALSE FALSE FALSE TRUE
>
>
>> which(table(DF$id) >= 3)
> d
> 4
>
>
>> names(which(table(DF$id) >= 3))
> [1] "d"
>
>
> Then use subset() as above, filtering only those id's that are in the
> names from the table. If we change the requirement to >= 2:
>
>> subset(DF, id %in% names(which(table(DF$id) >= 2)))
> id year
> b1 b 2000
> b2 b 2001
> c1 c 2000
> c2 c 2001
> d1 d 2000
> d2 d 2001
> d3 d 2002
>
>
> See ?table, ?names, ?which and ?"%in%"
>
>
> If you really need to add the column, you could use aggregate() to get a
> count of years for each id as a data frame, then use merge() to add the
> column to DF:
>
>> aggregate(DF$year, list(id = DF$id), length)
> id x
> 1 a 1
> 2 b 2
> 3 c 2
> 4 d 3
>
>
>> merge(DF, aggregate(DF$year, list(id = DF$id), length), by = "id")
> id year x
> 1 a 2000 1
> 2 b 2000 2
> 3 b 2001 2
> 4 c 2000 2
> 5 c 2001 2
> 6 d 2000 3
> 7 d 2001 3
> 8 d 2002 3
>
> Then use subset() as you initially considered.
>
> See ?aggregate and ?merge
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list