[R] Why do data frame column types vary across apply, lapply?
Erik Iverson
eriki at ccbr.umn.edu
Fri Apr 30 17:45:18 CEST 2010
>
> I still have little ability to predict how these functions will treat the
> columns of data frames:
All of this is explained by knowing what class of data functions *work
on*, and what class of data *you have*.
>
>> # Here's a data frame with a column "a" of integers,
>> # and a column "b" of characters:
>> df <- data.frame(
> + a = 1:2,
> + b = c("a","b")
> + )
>> df
> a b
> 1 1 a
> 2 2 b
First, let's see what we have?
Use str(df)
str(df)
'data.frame': 2 obs. of 2 variables:
$ a: int 1 2
$ b: Factor w/ 2 levels "a","b": 1 2
So we have a data.frame with two variables, one of class integer and one
of class factor. Notice how neither are of class character.
>> # Except -- both columns are characters:
>> apply (df, 2, typeof)
> a b
> "character" "character"
See ?apply. The apply function works on *matrices*. You're not passing
it a matrix, you're passing a data.frame. Matrices are two dimensional
vectors and are of *ONE* type. So apply could either
1) report an error saying "give me a matrix"
or
2) try to convert whatever you gave it to a matrix.
Apply does (2), and converts it to the best thing it can, a character
matrix. It can't be a numeric matrix since you have mixed types of
data, so it goes to the "lowest common denominator", a matrix of
characters. This is all explained in the first paragraph of ?apply.
>> # Except -- they're both integers:
>> lapply (df, typeof)
> $a
> [1] "integer"
>
> $b
> [1] "integer"
?typeof is probably not very useful for casual R use. I've never used
it. More useful is ?class. ?typeof is showing you how R is storing
this stuff low-level. Factors are just integer codes with labels, and
you have an integer variable and a factor variable, thus ?typeof reports
both integers.
Try lapply(df, class)
>
>> # Except -- only one of those integers is numeric:
>> lapply (df, is.numeric)
> $a
> [1] TRUE
>
> $b
> [1] FALSE
Yes, because you have a factor, and in the first 3 paragraphs of
?as.numeric, you'd see:
Factors are handled by the default method, and there
are methods for classes ‘"Date"’ and ‘"POSIXt"’ (in all three
cases the result is false). Methods for ‘is.numeric’ should only
return true if the base type of the class is ‘double’ or ‘integer’
_and_ values can reasonably be regarded as numeric (e.g.
arithmetic on them makes sense).
See, it all makes perfect sense :).
My advice? Don't worry about typeof. *Always* know what class your
objects are, and what class the functions you're using expect. Use ?str
liberally.
More information about the R-help
mailing list