[Rd] duplicated factor labels.
Paul Johnson
pauljohn32 at gmail.com
Thu Jun 15 02:00:11 CEST 2017
Dear R devel
I've been wondering about this for a while. I am sorry to ask for your
time, but can one of you help me understand this?
This concerns duplicated labels, not levels, in the factor function.
I think it is hard to understand that factor() fails, but levels()
after does not
> x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
Error in `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels, :
factor level [3] is duplicated
> y <- factor(x, levels = xlevels)
> levels(y) <- xlabels
> y
[1] 1 <NA> <NA> 4 4 4
Levels: 1 4
If the latter use of levels() causes a good, expected result, couldn't
factor(..., labels = xlabels) be made to the same thing?
That's the gist of it. To signal to you that I've been trying to
figure this out on my own, here is a revision I've tested in R's
factor function which "seems" to fix the matter. (Of course, probably
causes lots of other problems I don't understand, that's why I'm
writing to you now.)
In the factor function, the class of f is assigned *after* levels(f) is called
levels(f) <- ## nl == nL or 1
if (nl == nL) as.character(labels)
else paste0(labels, seq_along(levels))
class(f) <- c(if(ordered) "ordered", "factor")
At that point, f is an integer, and levels(f) is a primitive
> `levels<-`
function (x, value) .Primitive("levels<-")
That's what generates the error. I don't understand well what
.Primitive means here. I need to walk past that detail.
Suppose I revise the factor function to put the class(f) line before
the level(). Then `levels<-.factor` is called and all seems well.
factor <- function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
{
if (is.null(x))
x <- character()
nx <- names(x)
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
levels <- unique(as.character(y)[ind])
}
force(ordered)
if (!is.character(x))
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))]
f <- match(x, levels)
if (!is.null(nx))
names(f) <- nx
nl <- length(labels)
nL <- length(levels)
if (!any(nl == c(1L, nL)))
stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
nl, nL), domain = NA)
## class() moved up 3 rows
class(f) <- c(if (ordered) "ordered", "factor")
levels(f) <- if (nl == nL)
as.character(labels)
else paste0(labels, seq_along(levels))
f
}
> assignInNamespace("factor", factor, "base")
> x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
> y
[1] 1 <NA> <NA> 4 4 4
Levels: 1 4
> attributes(y)
$class
[1] "factor"
$levels
[1] "1" "4"
That's a "good" answer for me.
But I broke your function. I eliminated the check for duplicated levels.
> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
> y
[1] 1 4 <NA> <NA> <NA> <NA>
Levels: 1 4
Rather than have factor return the "duplicated levels" error when
there are duplicated values in labels, I wonder why it is not better
to have a check for duplicated levels directly. For example, insert a
new else in this stanza
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
levels <- unique(as.character(y)[ind])
} ##next is new part
else {
levels <- unique(levels)
}
That will cause an error when there are duplicated levels because
there are more labels than levels:
> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
Error in factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) :
invalid 'labels'; length 6 should be 1 or 2
So, in conclusion, if levels() can work after creating a factor, I
wish equivalent labels argument would be accepted. What is your
opinion?
pj
--
Paul E. Johnson http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu
To write to me directly, please address me at pauljohn at ku.edu.
More information about the R-devel
mailing list