[R] subset a data frame by largest frequencies of factors
Michael Friendly
friendly at yorku.ca
Thu Mar 5 19:45:11 CET 2015
A consulting client has a large data set with a binary response
(negative) and two factors (ctry and member) which have many levels,
but many occur with very small frequencies. It is far too sparse with a
model like glm(negative ~ ctry+member, family=binomial).
> str(Dataset)
'data.frame': 10672 obs. of 5 variables:
$ ctry : Factor w/ 31 levels "Barbados","Belize",..: 21 21 5 22 18
18 18 18 26 18 ...
$ member : Factor w/ 163 levels "","ADHOPIA, PREETI ",..: 150 19 19
111 120 1 1 4 55 18 ...
$ negative: int 0 1 0 1 1 1 1 0 0 0 ...
>
For analysis, we'd like to subset the data to include only those that
occur with frequency greater than a given
value, or the top 10 (say) in frequency, or the highest frequency
categories accounting for 80% (say) of the
total. I'm not sure how to do any of these in R. Can anyone help?
--
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept. & Chair, Quantitative Methods
York University Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele Street Web:http://www.datavis.ca
Toronto, ONT M3J 1P3 CANADA
More information about the R-help
mailing list