[R] Very confused with class
Dan Davison
davison at stats.ox.ac.uk
Thu Aug 21 17:57:54 CEST 2008
On Thu, Aug 21, 2008 at 04:20:57PM +0100, Williams, Robin wrote:
> Hi Dan,
> Thanks for the reply, yes, I am using read.csv on the attached file.
OK, so how about using the colClasses argument. Your problem is that
some malfunctioning software has inserted the value "#VALUE!" into
some of your supposedly numeric cells. So deal with that with the
na.strings argument. Like I said, when reading in data, it's worth
spending a minute looking at the documentation for read.table/read.csv
rather than spending an hour messing about with the results of not
doing so.
> Southwest <- read.csv("southwest.csv", colClasses=c("character",rep("numeric",10), "character"), na.strings="#VALUE!")
> str(Southwest)
'data.frame': 1530 obs. of 12 variables:
$ date : chr "5/1/1997" "5/2/1997" "5/3/1997" "5/4/1997" ...
$ maxtemp : num 18.8 21.8 16.6 14.9 14.2 9.3 9.9 12.7 12.8 13.2 ...
$ mintemp : num 7.7 9.8 11 12.2 11.3 4.5 2.1 5.7 6.7 7.3 ...
$ pressure : num 1028 1023 1015 1001 989 ...
$ humid : num 59 44 83 80 87 57 64 83 70 69 ...
$ wind : num 8.4 11.1 8.2 17.4 13.8 16.2 11.1 14.9 12.7 16.6 ...
$ rain : num 0 0 6 1 3.3 2.6 4.3 6 3.2 1.6 ...
$ index : num 1 2 3 4 5 6 7 8 9 10 ...
$ admissions: num 5.00 4.72 5.16 3.67 3.62 ...
$ detrended : num 4.79 4.47 5.30 3.91 3.51 ...
$ detrended2: num 4.79 4.47 5.30 3.91 3.51 ...
$ d.o.w. : chr "Thu" "Fri" "Sat" "Sun" ...
NB you could coerce those dates to a date class rather than character
but I'll leave that up to you.
str() is your friend.
Dan
> However, as when I do
> Southwest <- data.frame(read.csv("southwest.csv")
read.csv returns a data frame; no need to wrap it in data.frame()
> Names(southwest)
> the output is the column headings (i.e. the variables), and looking at
> the data I only get the numbers, I assume the column headings haven't
> become confused with the data.
> I.e. if I just do
> Southwest$pressure
> The output is correct, i.e. the values contained in the pressure column.
>
> Appologies for my repeated question, but I'm somewhat confused on this
> one and my lack of experience with R isn't helping matters. I don't even
> understand why R is interpreting these figures as factors in the first
> place, doesn't this imply that any similar data would be interpreted as
> factors?
> Thanks for any further help.
> Robin Williams
> Met Office summer intern - Health Forecasting
> robin.williams at metoffice.gov.uk
> -----Original Message-----
> From: Dan Davison [mailto:davison at stats.ox.ac.uk]
> Sent: Thursday, August 21, 2008 4:11 PM
> To: Williams, Robin
> Cc: r-help at r-project.org
> Subject: Re: [R] Very confused with class
>
> Hi Robin,
>
> You haven't said where you're getting the data from. But if the answer
> is that you're using read.table, read.csv or similar to read the data
> into R, then I advise you to go back to that stage and get it right from
> the outset. It's very, very common to see people who are relatively new
> to R splattering their code with calls to as.numeric, just because they
> haven't read the data in properly in the first place. It's also common
> in those who aren't new to R... So e.g. if you are using read.table,
> then use the colClasses argument to specify the classes of your columns,
> and use str() on the result until you're happy with the data frame
> produced.
>
> It's not entirely clear why you would have ended up with factors if your
> data are numeric. That often happens when people mix characters with
> numbers. Perhaps you have mixed the header row up with the data?
>
> Anyway, what you are seeing are the integer encodings of the factors.
> E.g.
>
> > f <- factor(11:20)
> > str(f)
> Factor w/ 10 levels "11","12","13",..: 1 2 3 4 5 6 7 8 9 10
> > as.numeric(f)
> [1] 1 2 3 4 5 6 7 8 9 10
>
> But don't mess with them. Just make sure that things which shouldn't be
> factors never become factors.
>
> Dan
>
> On Thu, Aug 21, 2008 at 03:40:58PM +0100, Williams, Robin wrote:
> > Hi all,
> > I am very confused with class.
> > I am looking at some weather data which I want to use as explanatory
>
> > variables in an lm. R has treated these variables as factors (i.e.
> > with different levels), whereas I want them treated as discretely
> > measured continuous variables. So I need to reassign the class of
> > these variables, right?
> > Indeed, doing
> > class(southwest$pressure)
> > (pressure being air pressure), I get
> > #> factor.
> > Now what class should I use to reassign them so that my model
> > fitting process goes as I want it to? I have obviously done something
> > wrong. I did southwest$pressure <- as(southwest$pressure,"numeric")
> > numeric seeming like a reasonable class to assign to this variable.
> > However, doing some summary stats like
> > mean(southwest$pressure)
> > #> 341,
> > max(southwest$pressure)
> > #> 761,
> > which is clearly nonsense, as my maximum value is around 1040.
> > Something similar has happened to maxtemp (maximum temperature), which
>
> > I also reassigned from a factor to class numeric, which now apparently
>
> > has a maximum value of 147!
> > Clearly it must be the reassignment of class that has caused these
> > problems, as summary stats on the data before I reassigned the classes
>
> > were fine. What is wrong with the class numeric? Reading the numeric
> > help page didn't reveal anything to me. Can someone suggest the
> > correct class?
> > Many thanks for any help.
> > Robin Williams
> > Met Office summer intern - Health Forecasting
> > robin.williams at metoffice.gov.uk
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> http://www.stats.ox.ac.uk/~davison
--
http://www.stats.ox.ac.uk/~davison
More information about the R-help
mailing list