[R] Still can't find missing data - How do I get NA in xtabs with factors?
Farley, Robert
FarleyR at metro.net
Fri May 29 20:14:10 CEST 2009
Let's see if I understand this. Do I iterate through
x <- factor(x, levels(c(levels(x), NA), exclude=NULL)
for each of the few hundred variables (x) in my data frame?
I tried to do this all at once and failed:
> ToyData
Data1 Data2 Data3 Weight
101 Sam Red Banana 1.1
102 Sam Green Banana 2.1
103 Sam Blue Orange 2.1
104 Fred Red Orange 2.1
105 Fred Green Guava 2.1
106 Fred Blue Guava 2.1
107 <NA> Red Pear 50.1
108 <NA> Green Pear 50.1
109 <NA> Blue <NA> 1000.2
> ToyData <- factor(ToyData, levels(c(levels(ToyData), NA), exclude=NULL, na.action=na.pass))
Error in levels(c(levels(ToyData), NA), exclude = NULL, na.action = na.pass) :
unused argument(s) (exclude = NULL, na.action = function (object, ...)
> ToyData <- factor(ToyData, levels(c(levels(ToyData), NA)))
> ToyData
Data1 Data2 Data3 Weight
<NA> <NA> <NA> <NA>
Levels:
>
But it didn't work. Don't I need to do this separately for each variable?
Is there a way to get read.spss to insert "NA" levels for each variable when I create the data frame? Is this because SPSS (and STATA) allow "NA" as an "undeclared level" and R does not?
Will this be a problem with read.dta as well?
Robert Farley
Metro
www.Metro.net
-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Thursday, May 28, 2009 20:39
To: Farley, Robert
Subject: RE: [R] Still can't find missing data
In R factors don't save space over character vectors - only
one copy of any given string is kept in memory in either case.
Factors do let you order the levels in the way you want and
that is often important in presentations.
You can add NA to the list of levels of a factor by doing
x <- factor(x, levels(c(levels(x), NA), exclude=NULL)
where 'x' represents each factor in your dataset. After
doing that is.na(x) will be all FALSE and you may not
want that for other situations.
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Farley, Robert
> Sent: Thursday, May 28, 2009 5:27 PM
> To: R-help
> Subject: Re: [R] Still can't find missing data
>
> That seems to work for the toy data. How do I implement this
> change with my real data, which are read from very large
> Stata and SPSS files and keep the factor definitions? Won't
> I be losing information (and creating a larger dataset) by
> not using the factor levels?
>
>
> How do I recover the factor values? I read my datafile
> (read.spss using use.value.labels = FALSE,) and got this:
>
> connector
> Mode_orig_only 1 9
> 1 17.814338 0.000000
> 3 49.128982 0.000000
> 4 525.978899 0.000000
> 5 913.295370 0.000000
> 6 114.302764 0.000000
> 7 298.151438 0.000000
> 8 93.088049 0.000000
> 9 233.794168 0.000000
> 10 20.764539 0.000000
> 11 424.120506 0.000000
> 12 8.054528 0.000000
> 13 6.010790 0.000000
> 14 1832.748525 0.000000
> 15 10191.284139 0.000000
> 16 2099.771923 0.000000
> 17 1630.148576 0.000000
> <NA> 0.000000 9491.013249
>
> which does have the "NA" row, but not the factor labels. If
> I read the file with use.value.labels=TRUE I can see what I'm
> summarizing, but not the NAs. Can't I have both?
>
> The top summary will also omit all 0 value factors (of
> course) in the variable summarized.
>
>
> The same summary using factors:
> connector
>
> Mode_orig_only
> OD Passenger Connector
>
> Walked/Biked
> 17.814338 0.000000
>
> I flew in from another a place/connected
> 0.000000 0.000000
>
> Amtrak
> 49.128982 0.000000
>
> Bus - Chartered bus or van
> 525.978899 0.000000
>
> Bus - Hotel Courtesy van
> 913.295370 0.000000
>
> Bus - MTA (Metro) or other public transit bus
> 114.302764 0.000000
>
> Bus - Scheduled airport bus or van (e.g. Airport bus or
> Disn 298.151438 0.000000
>
> Bus - Union Station Flyaway
> 93.088049 0.000000
>
> Bus - Van Nuys Flyaway
> 233.794168 0.000000
>
> Green line/light rail
> 20.764539 0.000000
>
> Limousine/town car
> 424.120506 0.000000
>
> Metrolink
> 8.054528 0.000000
>
> Motorcycle
> 6.010790 0.000000
>
> On-call shuttle/van (e.g. Super Shuttle, Prime Time)
> 1832.748525 0.000000
>
> Car/truck/van - Private
> 10191.284139 0.000000
>
> Car/truck/van - Rental
> 2099.771923 0.000000
>
> Taxi
> 1630.148576 0.000000
>
> ..Refused
> 0.000000 0.000000
>
>
>
>
>
>
>
> Robert Farley
> Metro
> www.Metro.net
>
>
> -----Original Message-----
> From: William Dunlap [mailto:wdunlap at tibco.com]
> Sent: Thursday, May 28, 2009 16:26
> To: Farley, Robert
> Subject: RE: [R] Still can't find missing data
>
> Try reading it in with read.table's argument stringsAsFactors=FALSE.
>
> I think the underlying problem is that exclude= is used only if
> the classifying variables are not already factors. I haven't studied
> the help file well enough to see if that is what is is documented
> to do, but it seems misleading.
>
> Bill Dunlap
> TIBCO Software Inc - Spotfire Division
> wdunlap tibco.com
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Farley, Robert
> > Sent: Thursday, May 28, 2009 4:10 PM
> > To: R-help
> > Subject: Re: [R] Still can't find missing data
> >
> > In this toy data, each of the tables should sum to 1111
> > None of the tables shows NA columns or rows.
> >
> >
> > > ################################
> > > ToyData <- read.table("C:/Data/R/Toy.csv", header=TRUE,
> > sep=",", na.strings="NA", dec=".", row.names="ID_Num")
> > > ToyData
> > Data1 Data2 Data3 Weight
> > 101 Sam Red Banana 1
> > 102 Sam Green Banana 2
> > 103 Sam Blue Orange 2
> > 104 Fred Red Orange 2
> > 105 Fred Green Guava 2
> > 106 Fred Blue Guava 2
> > 107 <NA> Red Pear 50
> > 108 <NA> Green Pear 50
> > 109 <NA> Blue <NA> 1000
> > > xtabs(Weight ~ Data1 + Data2, exclude=NULL,
> > na.action=na.pass, ToyData)
> > Data2
> > Data1 Blue Green Red
> > Fred 2 2 2
> > Sam 2 2 1
> > > xtabs(Weight ~ Data1 + Data2, exclude=NULL,
> > na.action=na.pass,drop.unused.levels = FALSE, ToyData)
> > Data2
> > Data1 Blue Green Red
> > Fred 2 2 2
> > Sam 2 2 1
> > > xtabs(Weight ~ Data1 + Data3, exclude=NULL,
> > na.action=na.pass,drop.unused.levels = FALSE, ToyData)
> > Data3
> > Data1 Banana Guava Orange Pear
> > Fred 0 4 2 0
> > Sam 3 0 2 0
> > >
> >
> >
> >
> >
> >
> > Robert Farley
> > Metro
> > www.Metro.net
> >
> >
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Dieter Menne
> > Sent: Thursday, May 28, 2009 05:46
> > To: r-help at r-project.org
> > Subject: Re: [R] Still can't find missing data
> >
> >
> >
> >
> > Farley, Robert wrote:
> > >
> > > I can't get the syntax that will allow me to show NA values
> > (rows) in the
> > > xtabs.
> > >
> > > lengthy non-reproducible example removed
> > >
> >
> > If you want a reproducible answer, prepare a reproducible
> > result. And check
> > that the
> > syntax is
> >
> > na.action=na.pass
> >
> > Dieter
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://www.nabble.com/Still-can%27t-find-missing-data-tp237306
> > 27p23761006.html
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list