[R] drop rare factors
Sam Steingold
sds at gnu.org
Thu Jan 19 21:43:08 CET 2012
create data:
mydata <- data.frame(MyFactor = factor(rep(LETTERS[1:4], times=c(1000, 2000, 30, 4))), something = runif(3034))
define function:
drop.levels <- function (df, column, threshold) {
size <- nrow(df)
if (threshold < 1) threshold <- threshold * size
tab <- table(df[column])
keep <- names(tab)[tab > threshold]
drop <- names(tab)[tab <= threshold]
cat("Keep(",column,")",length(keep),"\n"); print(tab[keep])
cat("Drop(",column,")",length(drop),"\n"); print(tab[drop])
str(df)
df <- df[df[column] %in% keep, ]
str(df)
size1 <- nrow(df)
cat("Rows:",size,"-->",size1,"(dropped",100*(size-size1)/size,"%)\n")
df[column] <- factor(df[column], levels=keep)
df
}
call the function on the data:
drop.levels(mydata,"MyFactor",5)
Keep( MyFactor ) 3
A B C
1000 2000 30
Drop( MyFactor ) 1
D
4
'data.frame': 3034 obs. of 2 variables:
$ MyFactor : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
$ something: num 0.725 0.741 0.608 0.681 0.993 ...
'data.frame': 0 obs. of 2 variables:
$ MyFactor : Factor w/ 4 levels "A","B","C","D":
$ something: num
Rows: 3034 --> 0 (dropped 100 %)
Error in `[<-.data.frame`(`*tmp*`, column, value = NA_integer_) :
replacement has 1 rows, data has 0
----- why is there a blank line between "Keep( MyFactor ) 3" and "A B C"
but no blank line between "Drop" and "D"?
----- why does "df[df[column] %in% keep, ]" empty out the data frame?
thanks!
> Remind the list what you're trying to do. The list gets lots of traffic;
> if you delete out all the context nobody will remember what you need.
Sorry, I assumed that people can easily access the parent messages.
--
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.PetitionOnline.com/tap12009/ http://pmw.org.il
http://mideasttruth.com http://memri.org http://openvotingconsortium.org
"Syntactic sugar causes cancer of the semicolon." -Alan Perlis
More information about the R-help
mailing list