Michael Friendly friendly at yorku.ca
Thu Mar 5 19:45:11 CET 2015

A consulting client has a large data set with a binary response 
(negative) and two factors (ctry and member) which have many levels,
but many occur with very small frequencies.  It is far too sparse with a 
model like glm(negative ~ ctry+member, family=binomial).

 > str(Dataset)
'data.frame':   10672 obs. of  5 variables:
  $ ctry    : Factor w/ 31 levels "Barbados","Belize",..: 21 21 5 22 18 
18 18 18 26 18 ...
  $ member  : Factor w/ 163 levels "","ADHOPIA, PREETI ",..: 150 19 19 
111 120 1 1 4 55 18 ...
  $ negative: int  0 1 0 1 1 1 1 0 0 0 ...

For analysis, we'd like to subset the data to include only those that 
occur with frequency greater than a given
value, or the top 10 (say) in frequency, or the highest frequency 
categories accounting for 80% (say) of the
total.  I'm not sure how to do any of these in R.  Can anyone help?

