[R] Problem with factor state when subset()ing a data.frame

Peter Dalgaard P.Dalgaard at biostat.ku.dk
Fri Feb 9 14:24:35 CET 2007

Roger Leigh wrote:
> Hi folks,
> I am running into a problem when calling subset() on a large
> data.frame.  One of the columns contains strings which are used as
> factors.  R seems to automatically factor the column when the
> data.frame is contstructed, and this appears to not get updated when I
> create a subset of the table.
> A minimal testcase to demonstrate the problem follows:
> [snip]
> Am I doing something wrong here, or is this an R bug?  
Not really, and no.

This has been discussed a number of times in the past, and the consensus
(grudgingly by some) seems to be that R's current behaviour is the
rational one. The basic issue is whether the fact that a factor level is
absent in a subgroup should change the level set . I.e., if you split a
population by occupation, should the fact that there are no women in the
subgroup of firefighters turn gender in to a one-level factor for that
group?  Sometimes it is sensible, but often it is not: If you do a
series of barplots of the gender distribution, should they not have an
empty bar for females when there are none? Similarly, if you have a
semiquantitative scale like terrible-poor-mediocre-good-excellent would
you not prefer to have tables and plots represent all five possible
values always?

> How can I get
> the new data.frame to update the factors, so I don't get redundant
> "empty" factors on the plot by eliminating the "phantom" factors?  (I
> also need to remove the unused factors for other analyses, and
> factoring them "by hand" seems a little redundant.)
You already know how (it's not redundant as you might want not to do
it). I don't think there's an easier way, but you can automate, as in

sb <- subset(.....)
isf <-  sapply(sb, is.factor)
sb[isf] <- lapply(sb[isf], factor)

   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

More information about the R-help mailing list