[Rd] problem in levels<- and other inconsistencies
Hervé Pagès
hpages at fredhutch.org
Tue Sep 27 23:20:50 CEST 2016
Hi,
I totally agree that having foo(x) <- foo(x) behave like a no-op
is a must. This is something I try to be careful about when I design
my own objects and their getters and setters.
Just wanted to mention though that there is notorious violation of
this:
x <- list(3:-1, NULL)
x[[2]] <- x[[2]]
x
# [[1]]
# [1] 3 2 1 0 -1
Now of course, not just because there is a precedent means the factor
API shouldn't be improved.
Cheers,
H.
On 09/27/2016 12:33 PM, Dr. Jens Oehlschlägel wrote:
> # A couple of years ago
> # I helped making R's character NA handling more consistent
> # Today I report an issue with R's factor NA handling
> # The core problem is that
> # levels(g) <- levels(g)
> # can change the levels of g
> # more details below
> # Kind regards
> # Jens Oehlschlägel
>
> # Say I have an NA element in a vector or list
>
> x <- c("a","b",NA)
>
> # then using split() it gets lost
>
> split(x, x)
>
> # as it is (somewhat) when converting to a default factor
>
> table(as.factor(x))
>
> # for table the workaround is
>
> table(as.factor(x), exclude=NULL)
>
> # but for split we need
>
> f <- factor(x, exclude=NULL)
>
> split(x, f)
>
> # conclusion: we MUST use an NA level
>
> # so far so good
>
> g <- f
> levels(g)
>
> # but re-assigning the levels changes them
>
> levels(g) <- levels(g)
> levels(g)
>
> # which I consider a severe problem.
> # Yes, I read the help page of levels<-
> # about removing levels by assigning NAs to them
> # but that implies: we MUST NOT use an NA level
>
> # If a language suggests
> # that we MUST and we MUST NOT use an NA level
> # the language has limited usefulness
> # (and a user who depends on the language
> # is put into a DOUBLE BIND)
> # SUGGESTION: assure the above assignment does not change levels
>
> # trying to apply the levels of f to new data also fails
>
> g <- factor(x, levels=levels(f))
> g
>
> # and giving both arguments even stops
>
> h <- factor(x, levels=levels(f), labels=levels(f))
>
> # I do understand that exclude= meaningfully has effect
> # if levels= are to be determined automatically, but
> # SUGGESTION: with explicit levels= exclude= should be ignored.
>
> # SUGGESTION: give split(x, y, exclude=NA) an exclude= argument,
> # which when set to NULL will prevent dropping NA levels
> # when coercing y to factor
> # (it still remains open what should have priority
> # if y is a factor with an NA-level and exclude=NA)
>
> table(f, exclude=NA)
>
> # here existing levels win over exclude=
> # which is consistent with my suggestion for factor(, levels=, exclude=)
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel
mailing list