[R] Splicing factors without losing levels

Peter Dalgaard p.dalgaard at biostat.ku.dk
Thu Jun 11 10:01:07 CEST 2009


Titus von der Malsburg wrote:
> On Tue, Jun 09, 2009 at 11:23:36AM +0200, ONKELINX, Thierry wrote:
>> For factors, you better convert them first back to character strings.
>>
>>   splice <- function(x, y) {
>> 	x <- levels(x)[x]
>> 	y <- levels(y)[y]
>> 	factor(as.vector(rbind(x, y)))
>>   } 
> 
> Thank you very much, Thierry!
> 
> I failed to mention something important in my last mail: x and y have
> the same levels.  (I assume that the integer to level name mapping of
> a factor defines its class and that it only makes sense to combine
> factors of the same class.)
> 
> Say
> 
>     > x <- factor(c(2,2,4,4), levels=1:4, labels=c("a","b","c","d"))
> 
> then
> 
>     > x
>     [1] b b d d
>     Levels: a b c d
> 
>     > as.integer(x)
>     [1] 2 2 4 4
> 
> but
> 
>     > splice(x,x)
>     [1] b b b b d d d d
>     Levels: b d
> 
>     > as.integer(splice(x,x))
>     [1] 1 1 1 1 2 2 2 2
> 
> I'd like to have a splice function that retains the level to label
> mapping.  One candidate for a solution is:
> 
>     splice <- function(x,y) {
>       xy <- as.vector(rbind(x, y))
>       if (is.factor(x) && is.factor(y))
>         xy <- factor(xy, levels=1:length(levels(x)), labels=levels(x))
>       xy
>     }
> 
> However, this relies on assumtions about the implementation of
> factors that are neither mentioned nor guaranteed in the man page:
> Levels are underlyingly integers starting from one and going to
> length(levels).  levels(x) gives me the labels of these integers in an
> order corresponding to 1:length(levels(x)).
> 
> Without these assumptions I see no way to recover the integer to level
> name mapping for levels that are defined in a factor but do not occur.
> 
> I'd be happy if somebody could clarify this issue!

Hm, well,... Some people have been quite insistent that factors should 
be though of as isomorphic to vectors over small subsets of character 
strings and not as isomorphic to small integers with labels. I tend  to 
disagree as it creates more complications than it solves.

Anyways, I would do it like this (generalizing "8" and the seq() bits is 
left as an exercise)

 > x <- factor(c(2,2,4,4), levels=1:4, labels=c("a","b","c","d"))
 > xx <- factor(rep(NA,8),levels=levels(x))
 > xx[seq(1,8,2)]<-x
 > xx[seq(2,8,2)]<-x
 > xx
[1] b b b b d d d d
Levels: a b c d
 > as.integer(xx)
[1] 2 2 2 2 4 4 4 4


> 
>   Titus
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
    O__  ---- Peter Dalgaard             Ă˜ster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907




More information about the R-help mailing list