[R] subset a data frame by largest frequencies of factors

Fri Mar 6 11:15:32 CET 2015

> -----Original Message-----
> A consulting client has a large data set with a binary response
> (negative) and two factors (ctry and member) which have many levels, but
> many occur with very small frequencies.  It is far too sparse with a model like
> glm(negative ~ ctry+member, family=binomial).
> 
> For analysis, we'd like to subset the data to include only those that occur with
> frequency greater than a given value

ave() helps with this kind of thing. 

Something like

freq <- ave(1:length(ctry), factor(ctry:member), FUN=length)

gives the count for each ctry:member call. Then you can subset a data frame using, for example

dfr.subset <- dfr[freq>10, ]

The 1:length(ctry) in the ave call is simply because ave wants a numeric there. If all we're doing with it is counting the number, it just has to be a numeric of the same length as your data. in a data frame it can be 1:nrow(dfr) etc.

S Ellison

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}