[R] how to collapse categories or re-categorize variables?
Henric Winell
nilsson.henric at gmail.com
Mon Jul 19 19:06:51 CEST 2010
On 2010-07-17 23:03, Peter Dalgaard wrote:
> Ista Zahn wrote:
>> Hi,
>> On Fri, Jul 16, 2010 at 5:18 PM, CC <turtysmail at gmail.com> wrote:
>>> I am sure this is a very basic question:
>>>
>>> I have 600,000 categorical variables in a data.frame - each of which is
>>> classified as "0", "1", or "2"
>>>
>>> What I would like to do is collapse "1" and "2" and leave "0" by itself,
>>> such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in
>>> the end I only want "0" and "1" as categories for each of the variables.
>> Something like this should work
>>
>> for (i in names(dat)) {
>> dat[, i] <- factor(dat[, i], levels = c("0", "1", "2"), labels =
>> c("0", "1", "1))
>> }
>
> Unfortunately, it won't:
>
>> d <- 0:2
>> factor(d, levels=c(0,1,1))
> [1] 0 1 <NA>
> Levels: 0 1 1
> Warning message:
> In `levels<-`(`*tmp*`, value = c("0", "1", "1")) :
> duplicated levels will not be allowed in factors anymore
>
>
> This effect, I have been told, goes way back to design choices in S
> (that you can have repeated level names) plus compatibility ever since.
>
> It would make more sense if it behaved like
>
> d <- factor(d); levels(d) <- c(0,1,1)
>
> and maybe, some time in the future, it will. Meanwhile, the above is the
> workaround.
>
> (BTW, if there are 600000 variables, you probably don't want to iterate
> over their names, more likely "for(i in seq_along(dat))...")
You could also use 'lapply' with 'levels<-':
> ### Example data
> set.seed(1)
> d <- 0:2
> DF <- data.frame(X1 = factor(sample(d, size = 10, replace = TRUE)),
+ X2 = factor(sample(d, size = 10, replace = TRUE)))
> DF
X1 X2
1 0 0
2 1 0
3 1 2
4 2 1
5 0 2
6 2 1
7 2 2
8 1 2
9 1 1
10 0 2
>
> ### Reorder levels and replace
> DF[] <- lapply(DF, function(x) "levels<-"(x, c("0", "1", "1")))
> DF
X1 X2
1 0 0
2 1 0
3 1 1
4 1 1
5 0 1
6 1 1
7 1 1
8 1 1
9 1 1
10 0 1
HTH,
Henric
More information about the R-help
mailing list