[R] Handling of factors

Wed Jan 21 04:14:52 CET 2009

I'm rather confused by the semantics of factors.

When applied to factors, some functions (whose results are elements of
the original factor argument) return results of class factor, some
return integer vectors, some return character vectors, some give
errors.  I understand some but not all of this.  Consider:

Preserve factors: `[`, `[[`, sort, unique, subset, head, tapply, rep, rev, by,
      sample, expand.grid,
as.matrix(structure(factor(1:3),dim=c(1,3))), data.frame, list
Convert to integers: c, ifelse, cbind/rbind
Convert to characters: intersect, union, setdiff, matrix, array,
matrix(factor(1:3),1,3),
      as.matrix(factor(1:3))
Gives error: rle
No error (output of some other type): <, ==, etc.

In the case of ordered factors:

Preserve factors: quantile (for exact quantiles only)
Gives error: min, cut, range
No error: which.min, pmin, rank
(But some operations which are meaningful only on ordered factors also
give results on unordered factors, without even a warning: which.min,
pmin, rank, quantile.)

The general principle seems to be that if the result can contain only
elements of a single factor, then a factor is returned.  I understand
this: it may not be meaningful to mingle factors with different level
sets.  But I don't understand what the problem is with rle.

If the result can contain elements from more than one factor, it is
still not clear to me what the principle is for determining whether
the factors are converted to the integers representing them, or to the
characters naming them, or that the operation gives an error.

I also don't understand what is going on with min. min is well-defined
for any class supporting a < operator, but though < works on ordered
factors as do pmin, rank, etc., min does not.  And equally strangely,
which.min and rank blithely convert *un*ordered factors to the
integers which happen to represent them, returning what are presumably
meaningless results without giving an error; while pmin appropriately
gives an error.

It is all very confusing.  Of course, most of this behavior is
documented and is easily determined by experimentation, but it would
be easier to learn and teach the language if there were some clear
principle underlying all this.  What am I missing?

              -s