[Rd] duplicated factor labels.
Martin Maechler
maechler at stat.math.ethz.ch
Fri Jun 23 10:42:30 CEST 2017
>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>> on Thu, 22 Jun 2017 11:43:59 +0200 writes:
>>>>> Paul Johnson <pauljohn32 at gmail.com>
>>>>> on Fri, 16 Jun 2017 11:02:34 -0500 writes:
>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <jorismeys at gmail.com> wrote:
>>> To extwnd on Martin 's explanation :
>>>
>>> In factor(), levels are the unique input values and labels the unique output
>>> values. So the function levels() actually displays the labels.
>>>
>> Dear Joris
>> I think we agree. Currently, factor insists both levels and labels be unique.
>> I wish that it would not accept nonunique labels. I also understand it
>> is impractical to change this now in base R.
>> I don't think I succeeded in explaining why this would be nicer.
>> Here's another example. Fairly often, we see input data like
>> x <- c("Male", "Man", "male", "Man", "Female")
>> The first four represent the same value. I'd like to go in one step
>> to a new factor variable with enumerated types "Male" and "Female".
>> This fails
>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"),
>> labels = c("Male", "Male", "Male", "Female"))
>> Instead, we need 2 steps.
>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"))
>> levels(xf) <- c("Male", "Male", "Male", "Female")
>> I think it is quirky that `levels<-.factor` allows the duplicated
>> labels, whereas factor does not.
>> I wrote a function rockchalk::combineLevels to simplify combining
>> levels, but most of the students here like plyr::mapvalues to do it.
>> The use of levels() can be tricky because one must enumerate all
>> values, not just the ones being changed.
>> But I do understand Martin's point. Its been this way 25 years, it
>> won't change. :).
> Well.. the above is a bit out of context.
> Your first example really did not make a point to me (and Joris)
> and I showed that you could use even two different simple factor() calls to
> produce what you wanted
> yc <- factor(c("1",NA,NA,"4","4","4"))
> yn <- factor(c( 1, NA,NA, 4, 4, 4))
> Your new example is indeed much more convincing !
> (Note though that the two steps that are needed can be written
> more shortly
> The "been this way 25 years" is one a reason to be very
> cautious(*) with changes, but not a reason for no changes!
> (*) Indeed as some of you have noted we really should not "break behavior".
> This means to me we cannot accept a change there which gives
> an error or a different result in cases the old behavior gave a valid factor.
> I'm looking at a possible change currently
> [not promising that a change will happen ...]
In the end, I've liked the change (after 2-3 iterations), and
now been brave to commit to R-devel (svn 72845).
With the change, I had to disable one of our own regression
checks (tests/reg-tests-1b.R, line 726):
The following is now (in R-devel -> R 3.5.0) valid:
> factor(1:2, labels = c("A","A"))
[1] A A
Levels: A
>
I wonder how many CRAN package checks will "break" from
this (my guess is in the order of a dozen), but I hope
that these breakages will be benign, e.g., similar to the above
case where before an error was expected via tools :: assertError(.)
Martin
More information about the R-devel
mailing list