[R] Why Numeric Values Become Factors in Data Frame
Marc Schwartz
marc_schwartz at me.com
Tue Nov 29 20:40:27 CET 2011
On Nov 29, 2011, at 1:18 PM, Rich Shepard wrote:
> I have a data frame with 1 factor, one date, and 37 numeric values:
> str(waterchem)
> 'data.frame': 3525 obs. of 39 variables:
> site : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 ...
> $ sampdate : Date, format: "2007-12-12" "2008-03-15" ...
> $ CO3 : num 1 1 6.7 1 1 1 1 1 1 1 ...
> $ HCO3 : num 231 228 118 246 157 208 338 285 260 240 ...
> $ Ca : num 100 88.4 63.4 123 78.2 103 265 213 178 166 ...
> $ DO : num 4.96 9.91 4.32 2.58 1.81 5.09 3.98 5.46 1.9 2.52 ...
> ...
> $ SC : Factor w/ 841 levels "1.090","10.000",..: 635 638 363
>
> All the numeric categories are read in as numbers except for some of those
> in column 'SC'. I have been looking in the source file for a couple of hours
> trying to learn why values such as 1.090 and 10.000 are seen as characters
> rather than numbers. I've not see the reason.
>
> The source file is 860K and looks like this:
>
> site|sampdate|'Ag'|'Al'|'CO3'|'HCO3'|'Alk-Tot'|'As'|'Ba'|'Be'|'Bi'|'Ca'|'Cd'|'Cl'|'Co'|'Cr'|'Cu'|'DO'|'Fe'|'Hg'|'K'|'Mg'|'Mn'|'Mo'|'Na'|'NH4'|'NO3-NO2'|'Oil-grease'|'Pb'|'pH'|'Sb'|'SC'|'Se'|'SO4'|'Sr'|'TDS'|'Tl'|'V'|'Zn'
> 'D-1'|'2007-12-12'|0.000|0.106|1.000|231.000|231.000|0.011|0.000|0.002|0.000|100.000|0.000|1.430|0.000|0.006|0.024|4.960|4.110|NA|0.000|9.560|0.035|0.000|0.970|0.010|0.293|NA|0.025|7.800|0.001|630.000|0.001|65.800|0.000|320.000|0.001|0.000|11.400
> 'D-1'|'2008-03-15'|0.000|0.080|1.000|228.000|228.000|0.001|0.000|0.002|0.000|88.400|0.000|1.340|0.000|0.006|0.014|9.910|0.309|0.000|0.000|9.150|0.047|0.000|0.820|0.224|0.020|NA|0.025|7.940|0.001|633.000|0.001|75.400|0.000|300.000|0.001|0.000|12.400
>
> The R command used to create the data frame is:
> waterchem <- read.table('wqR.txt', header = TRUE, sep = '|')
>
> Pointers on how to determine why this one variable has some values and
> characters rather than as numerics are needed.
>
> Rich
Rich,
Somewhere in that column are non-numeric characters (other than 0 through 9 and a decimal point), resulting in the column being coerced to a factor.
Not fully tested, but using grepl() along the lines of:
Vec <- c(1.09, 1.23, "1,23", "A", 2.067)
> which(grepl("[^0-9\\.]", Vec))
[1] 3 4
Will give you the indices of the entries in the column that contain non-numeric characters.
> Vec[which(grepl("[^0-9\\.]", Vec))]
[1] "1,23" "A"
Will give you the entries themselves.
The read.table() family of functions use type.convert() internally to do the data type coercions:
> type.convert(Vec)
[1] 1.09 1.23 1,23 A 2.067
Levels: 1,23 1.09 1.23 2.067 A
So 'Vec' is coerced to a factor due to the non-numeric characters contained in the entries.
HTH,
Marc Schwartz
More information about the R-help
mailing list