[R] multi-column factor

Rui Barradas ruipbarradas at sapo.pt
Sun Sep 16 19:26:47 CEST 2012


Hello,

The obvious simplification is to call union()  only once. With 10M rows 
it should save time.
Then I've asked myself whether unique() wouldn't be faster.


f1 <- function(x){
     x[[1]] <- factor(x[[1]], levels = union(x[[1]], x[[2]]))
     x[[2]] <- factor(x[[2]], levels = union(x[[1]], x[[2]]))
     x
}

f2 <- function(x){
     levels <- union(x[[1]], x[[2]])
     x[[1]] <- factor(x[[1]], levels = levels)
     x[[2]] <- factor(x[[2]], levels = levels)
     x
}

f3 <- function(x){
     levels <- unique(c(x[[1]], x[[2]]))
     x[[1]] <- factor(x[[1]], levels = levels)
     x[[2]] <- factor(x[[2]], levels = levels)
     x
}

set.seed(5467)
n <- 1e7
z <- data.frame(a = sample(letters[1:3], n, TRUE),
     b = sample(letters[2:4], n, TRUE),
     stringsAsFactors=FALSE)

t1 <- system.time(z1 <- f1(z))
t2 <- system.time(z2 <- f2(z))
t3 <- system.time(z3 <- f3(z))

identical(z1, z2) #[1] TRUE
identical(z1, z3) #[1] TRUE

rbind(t1, t2, t3)
    user.self sys.self elapsed user.child sys.child
t1      2.55     0.47    3.01         NA        NA
t2      1.57     0.29    1.87         NA        NA
t3      1.51     0.26    1.78         NA        NA

Hope this helps,

Rui Barradas

Em 16-09-2012 17:46, Sam Steingold escreveu:
> I have a data frame with columns which draw on the same underlying
> universe, so I want them to be factors with the same level set:
>
> --8<---------------cut here---------------start------------->8---
>> z <- data.frame(a=c("a","b","c"),b=c("b","c","d"),stringsAsFactors=FALSE)
>> str(z)
> 'data.frame':	3 obs. of  2 variables:
>   $ a: chr  "a" "b" "c"
>   $ b: chr  "b" "c" "d"
>> z$a <- factor(z$a,levels=union(z$a,z$b))
>> z$b <- factor(z$b,levels=union(z$a,z$b))
>> str(z)
> 'data.frame':	3 obs. of  2 variables:
>   $ a: Factor w/ 4 levels "a","b","c","d": 1 2 3
>   $ b: Factor w/ 4 levels "a","b","c","d": 2 3 4
> --8<---------------cut here---------------end--------------->8---
> factor(z$a,levels=union(z$a,z$b))
> is factor(z$a,levels=union(z$a,z$b)) the right way to handle this?
> maybe there is a better way to extract levels than union()?
> (bear in mind that I have ~10M rows and ~1M levels, so performance is an
> issue).
>
> Thanks!
>




More information about the R-help mailing list