[R] Subscripting problem with is.na()
William Dunlap
wdunlap at tibco.com
Fri Jun 24 17:42:55 CEST 2016
Is part of the issue that in common parlance "NA" or "N/A" may
mean either "not available" or "not applicable" (e.g., isPregnant
for a male) but in R NA means only "not available"?
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Fri, Jun 24, 2016 at 8:37 AM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
> As Petr and Don have shown you, changing NA to 0 is unnecessary to get
> what you want. However, recoding to 0 may be OK, as NA has a specific
> meaning in this context, and you are just adding an extra code to a
> factor for a different level.
>
> But it still might cause you trouble later. One of R's strengths is
> it's ability to simply deal with NA's -- most of the time anyway .For
> example note that you would have to make sure these columns are
> factors (*not numerics*), if you wanted to, say, investigate how
> category of closing related to other covariates via e.g. multinomial
> logistic regression or even just to tabulate the "closed" categories.
> Keeping NA as NA allows R's built-in facilities to simply handle (e.g.
> omit) the data for the "still open" cases, but you will have to do it
> explicitly yourself if you code to 0. That seems to be asking for
> trouble to me.
>
> As always, contrary views welcome. This discussion still seems on
> (r-help) topic to me, but if not, please say so.
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Jun 24, 2016 at 12:14 AM, <G.Maubach at gmx.de> wrote:
> > Hi Bert,
> >
> > many thanks for all your help and your comments. I learn at lot this way.
> >
> > My question was about is.na() at the first sight but the actual task
> looks like this:
> >
> > I have two variables in my customer data that signal if the customer
> accout was closed by master data management or by sales. Say these
> variables are closed_mdm and closed_sls. They contain NA if the customer
> account is still open or a closing code from "01" to "08" if the customer
> account was closed and why.
> >
> > For my analysis I need a variable that combines the two variables
> closed_mdm and closed_sls to set a filter easily on those who are closed
> not matter what the reason was nor who closed the account.
> >
> > As I always encounter problems when dealing with ifelse statements and
> NA I decided to merge these two variables to one variable containing 0 =
> not closed and 1 = closed. In my context this seems to be - at least to me
> - a reasonable approach.
> >
> > Replacement of missing values and merging the variables is the easiest
> way for me.
> >
> > -- cut --
> >
> > cust_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
> 18, 19, 20)
> > closed_mdm <- c("01", NA, NA, NA, "08", "07", NA, NA, "05", NA, NA, NA,
> "04", NA, NA, NA, NA, NA, NA, NA)
> > closed_sls <- c(NA, "08", NA, NA, "08", "07", NA, NA, NA, NA, "03", NA,
> NA, NA, "05", NA, NA, NA, NA, NA)
> >
> > # 1st try
> > ds_temp1 <- data.frame(cust_id, closed_mdm, closed_sls)
> > ds_temp1
> >
> > ds_temp1$closed <- closed_mdm | closed_sls # WRONG
> >
> > # 2nd try
> > closed_mdm_fac1 <- as.factor(closed_mdm)
> > closed_sls_fac1 <- as.factor(closed_sls)
> >
> > ds_temp2 <- data.frame(cust_id, closed_mdm_fac1, closed_sls_fac1)
> > ds_temp2
> >
> > ds_temp2$closed <- ds_temp$closed_mdm_fac1 | ds_temp$closed_sls_fac1 #
> WRONG
> >
> > # 3rd try
> > closed_mdm_num1 <- as.numeric(closed_mdm) # OK
> > closed_sls_num1 <- as.numeric(closed_sls) # OK
> >
> > ds_temp3 <- data.frame(cust_id, closed_mdm_num1, closed_sls_num1)
> > ds_temp3
> >
> > ds_temp3$closed <- ds_temp$closed_mdm_num1 | ds_temp$closed_sls_num1 #
> WRONG
> >
> > # 4th try
> > ds_temp4 <- ds_temp3
> > ds_temp4
> >
> > # Does not run due to not allowed NA in subscripts
> > ds_temp4[is.na(ds_temp4$closed_mdm_num1), ds_temp4$closed_mdm_num1] <- 0
> > ds_temp4[is.na(ds_temp4$closed_sls_num1), ds_temp4$closed_sls_num1] <- 0
> >
> > # 5th try
> > ds_temp4$closed_mdm_num1 <- ifelse(is.na(ds_temp4$closed_mdm_num1), 1,
> 0)
> > ds_temp4$closed_sls_num1 <- ifelse(is.na(ds_temp4$closed_sls_num1), 1,
> 0)
> > ds_temp4
> >
> > ds_temp4$closed <- ifelse(ds_temp4$closed_mdm_num1 == 1 |
> ds_temp4$closed_sls_num1 == 1, 1, 0)
> > ds_temp4
> >
> > -- cut --
> >
> > Is there a better way to do it?
> >
> > Kind regards
> >
> > Georg
> >
> >
> >> Gesendet: Donnerstag, 23. Juni 2016 um 23:55 Uhr
> >> Von: "Bert Gunter" <bgunter.4567 at gmail.com>
> >> An: "David L Carlson" <dcarlson at tamu.edu>
> >> Cc: "R Help" <r-help at r-project.org>
> >> Betreff: Re: [R] Subscripting problem with is.na()
> >>
> >> ... actually, FWIW, I would say that this little discussion mostly
> >> demonstrates why the OP's request is probably not a good idea in the
> >> first place. Usually, NA's should be left as NA's to be dealt with
> >> properly by R and packages. In biological measurements, for example,
> >> NA's often mean "below the ability to reliably measure." Biologists
> >> with whom I've worked over many years often want to convert these to 0
> >> or omit the cases, both of which lead to biased estimates and/or
> >> underestimates of variability and excess claims of "statistical
> >> significance" (for those who belong to this religious persuasion). One
> >> should never say never, but I suspect that there are relatively few
> >> circumstances where the conversion the OP requested is actually wise.
> >>
> >> Feel free to ignore/reject such extraneous comments of course.
> >>
> >> Cheers,
> >> Bert
> >>
> >>
> >> Bert Gunter
> >>
> >> "The trouble with having an open mind is that people keep coming along
> >> and sticking things into it."
> >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>
> >>
> >> On Thu, Jun 23, 2016 at 12:14 PM, David L Carlson <dcarlson at tamu.edu>
> wrote:
> >> > Good point. I did not think about factors. Also your example raises
> another issue since column c is logical, but gets silently converted to
> numeric. This would seem to get the job done assuming the conversion is
> intended for numeric columns only:
> >> >
> >> >> test <- data.frame(a=c(1,NA,2), b = c("A","b",NA), c= rep(NA,3))
> >> >> sapply(test, class)
> >> > a b c
> >> > "numeric" "factor" "logical"
> >> >> num <- sapply(test, is.numeric)
> >> >> test[, num][is.na(test[, num])] <- 0
> >> >> test
> >> > a b c
> >> > 1 1 A NA
> >> > 2 0 b NA
> >> > 3 2 <NA> NA
> >> >
> >> > David C
> >> >
> >> > -----Original Message-----
> >> > From: Bert Gunter [mailto:bgunter.4567 at gmail.com]
> >> > Sent: Thursday, June 23, 2016 1:48 PM
> >> > To: David L Carlson
> >> > Cc: Ivan Calandra; R Help
> >> > Subject: Re: [R] Subscripting problem with is.na()
> >> >
> >> > Not in general, David:
> >> >
> >> > e.g.
> >> >
> >> >> test <- data.frame(a=c(1,NA,2), b = c("A","b",NA), c= rep(NA,3))
> >> >
> >> >> is.na(test)
> >> > a b c
> >> > [1,] FALSE FALSE TRUE
> >> > [2,] TRUE FALSE TRUE
> >> > [3,] FALSE TRUE TRUE
> >> >
> >> >> test[is.na(test)]
> >> > [1] NA NA NA NA NA
> >> >
> >> >> test[is.na(test)] <- 0
> >> > Warning message:
> >> > In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
> >> > invalid factor level, NA generated
> >> >
> >> >> test
> >> > a b c
> >> > 1 1 A 0
> >> > 2 0 b 0
> >> > 3 2 <NA> 0
> >> >
> >> >
> >> > The problem is the default conversion to factors and the replacement
> >> > operation for factors. So:
> >> >
> >> >> test <- data.frame(a=c(1,NA,2), b = I(c("A","b",NA_character_)), c=
> rep(NA,3))
> >> >> class(test$b)
> >> > [1] "AsIs" ## so NOT a factor
> >> >
> >> >> test[is.na(test)] <- 0 # now works as you describe
> >> >> test
> >> > a b c
> >> > 1 1 A 0
> >> > 2 0 b 0
> >> > 3 2 0 0
> >> >
> >> > Of course the OP (and you) probably had a data frame of all numerics
> >> > in mind, so the problem doesn't arise. But I think one needs to make
> >> > the distinction and issue clear.
> >> >
> >> > Cheers,
> >> > Bert
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Bert Gunter
> >> >
> >> > "The trouble with having an open mind is that people keep coming along
> >> > and sticking things into it."
> >> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >> >
> >> >
> >> > On Thu, Jun 23, 2016 at 8:46 AM, David L Carlson <dcarlson at tamu.edu>
> wrote:
> >> >> The function is.na() returns a matrix when applied to a data.frame
> so you can easily convert all the NAs to 0's:
> >> >>
> >> >>> ds_test
> >> >> var1 var2
> >> >> 1 1 1
> >> >> 2 2 2
> >> >> 3 3 3
> >> >> 4 NA NA
> >> >> 5 5 5
> >> >> 6 6 6
> >> >> 7 7 7
> >> >> 8 NA NA
> >> >> 9 9 9
> >> >> 10 10 10
> >> >>> is.na(ds_test)
> >> >> var1 var2
> >> >> [1,] FALSE FALSE
> >> >> [2,] FALSE FALSE
> >> >> [3,] FALSE FALSE
> >> >> [4,] TRUE TRUE
> >> >> [5,] FALSE FALSE
> >> >> [6,] FALSE FALSE
> >> >> [7,] FALSE FALSE
> >> >> [8,] TRUE TRUE
> >> >> [9,] FALSE FALSE
> >> >> [10,] FALSE FALSE
> >> >>> ds_test[is.na(ds_test)] <- 0
> >> >>> ds_test
> >> >> var1 var2
> >> >> 1 1 1
> >> >> 2 2 2
> >> >> 3 3 3
> >> >> 4 0 0
> >> >> 5 5 5
> >> >> 6 6 6
> >> >> 7 7 7
> >> >> 8 0 0
> >> >> 9 9 9
> >> >> 10 10 10
> >> >>
> >> >> -------------------------------------
> >> >> David L Carlson
> >> >> Department of Anthropology
> >> >> Texas A&M University
> >> >> College Station, TX 77840-4352
> >> >>
> >> >> -----Original Message-----
> >> >> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of
> Ivan Calandra
> >> >> Sent: Thursday, June 23, 2016 10:14 AM
> >> >> To: R Help
> >> >> Subject: Re: [R] Subscripting problem with is.na()
> >> >>
> >> >> Thank you Bert for this clarification. It is indeed an important
> point.
> >> >>
> >> >> Ivan
> >> >>
> >> >> --
> >> >> Ivan Calandra, PhD
> >> >> Scientific Mediator
> >> >> University of Reims Champagne-Ardenne
> >> >> GEGENAA - EA 3795
> >> >> CREA - 2 esplanade Roland Garros
> >> >> 51100 Reims, France
> >> >> +33(0)3 26 77 36 89
> >> >> ivan.calandra at univ-reims.fr
> >> >> --
> >> >> https://www.researchgate.net/profile/Ivan_Calandra
> >> >> https://publons.com/author/705639/
> >> >>
> >> >> Le 23/06/2016 à 17:06, Bert Gunter a écrit :
> >> >>> Sorry, Ivan, your statement is incorrect:
> >> >>>
> >> >>> "When you use a single bracket on a list with only one argument in
> >> >>> between, then R extracts "elements", i.e. columns in the case of a
> >> >>> data.frame. This explains your errors. "
> >> >>>
> >> >>> e.g.
> >> >>>
> >> >>>> ex <- data.frame(a = 1:3, b = letters[1:3])
> >> >>>> a <- 1:3
> >> >>>> identical(ex[1], a)
> >> >>> [1] FALSE
> >> >>>
> >> >>>> class(ex[1])
> >> >>> [1] "data.frame"
> >> >>>> class(a)
> >> >>> [1] "integer"
> >> >>>
> >> >>> Compare:
> >> >>>
> >> >>>> identical(ex[[1]], a)
> >> >>> [1] TRUE
> >> >>>
> >> >>> Why? Single bracket extraction on a list results in a list; double
> >> >>> bracket extraction results in the element of the list ( a "column"
> in
> >> >>> the case of a data frame, which is a specific kind of list). The
> >> >>> relevant sections of ?Extract are:
> >> >>>
> >> >>> "Indexing by [ is similar to atomic vectors and selects a **list**
> of
> >> >>> the specified element(s).
> >> >>>
> >> >>> Both [[ and $ select a **single element of the list**. "
> >> >>>
> >> >>>
> >> >>> Hope this clarifies this often-confused issue.
> >> >>>
> >> >>>
> >> >>> Cheers,
> >> >>> Bert
> >> >>> Bert Gunter
> >> >>>
> >> >>> "The trouble with having an open mind is that people keep coming
> along
> >> >>> and sticking things into it."
> >> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >> >>>
> >> >>>
> >> >>> On Thu, Jun 23, 2016 at 7:34 AM, Ivan Calandra
> >> >>> <ivan.calandra at univ-reims.fr> wrote:
> >> >>>> My statement "Using a single bracket '[' on a data.frame does the
> same as
> >> >>>> for matrices: you need to specify rows and columns" was not
> correct.
> >> >>>>
> >> >>>>
> >> >>>> When you use a single bracket on a list with only one argument in
> between,
> >> >>>> then R extracts "elements", i.e. columns in the case of a
> data.frame. This
> >> >>>> explains your errors.
> >> >>>>
> >> >>>> But it is possible to use a single bracket on a data.frame with 2
> arguments
> >> >>>> (rows, columns) separated by a comma, as with matrices. This is
> the solution
> >> >>>> you received.
> >> >>>>
> >> >>>> Ivan
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Ivan Calandra, PhD
> >> >>>> Scientific Mediator
> >> >>>> University of Reims Champagne-Ardenne
> >> >>>> GEGENAA - EA 3795
> >> >>>> CREA - 2 esplanade Roland Garros
> >> >>>> 51100 Reims, France
> >> >>>> +33(0)3 26 77 36 89
> >> >>>> ivan.calandra at univ-reims.fr
> >> >>>> --
> >> >>>> https://www.researchgate.net/profile/Ivan_Calandra
> >> >>>> https://publons.com/author/705639/
> >> >>>>
> >> >>>> Le 23/06/2016 à 16:27, Ivan Calandra a écrit :
> >> >>>>> Dear Georg,
> >> >>>>>
> >> >>>>> You need to learn a bit more about the subsetting methods,
> depending on
> >> >>>>> the object structure you're trying to subset.
> >> >>>>>
> >> >>>>> More specifically, when you run this: ds_test[is.na
> (ds_test$var1)]
> >> >>>>> you get this error: "Error in `[.data.frame`(ds_test, is.na
> (ds_test$var1))
> >> >>>>> : undefined columns selected"
> >> >>>>>
> >> >>>>> This means that R does not understand which column you're trying
> to
> >> >>>>> select. But you're actually trying to select rows.
> >> >>>>>
> >> >>>>> Using a single bracket '[' on a data.frame does the same as for
> matrices:
> >> >>>>> you need to specify rows and columns, like this:
> >> >>>>> ds_test[is.na(ds_test$var1), ] ## notice the last comma
> >> >>>>> ds_test[is.na(ds_test$var1), ] <- 0 ## works on all columns
> because you
> >> >>>>> didn't specify any after the comma
> >> >>>>>
> >> >>>>> If you want it only for "var1", then you need to specify the
> column:
> >> >>>>> ds_test[is.na(ds_test$var1), "var1"] <- 0
> >> >>>>>
> >> >>>>> It's the same problem with your 2nd and 4th tries (4th one has
> other
> >> >>>>> problems). Your 3rd try does not change ds_test at all.
> >> >>>>>
> >> >>>>> HTH,
> >> >>>>> Ivan
> >> >>>>>
> >> >>>>> --
> >> >>>>> Ivan Calandra, PhD
> >> >>>>> Scientific Mediator
> >> >>>>> University of Reims Champagne-Ardenne
> >> >>>>> GEGENAA - EA 3795
> >> >>>>> CREA - 2 esplanade Roland Garros
> >> >>>>> 51100 Reims, France
> >> >>>>> +33(0)3 26 77 36 89
> >> >>>>> ivan.calandra at univ-reims.fr
> >> >>>>> --
> >> >>>>> https://www.researchgate.net/profile/Ivan_Calandra
> >> >>>>> https://publons.com/author/705639/
> >> >>>>>
> >> >>>>> Le 23/06/2016 à 15:57, G.Maubach at weinwolf.de a écrit :
> >> >>>>>> Hi All,
> >> >>>>>>
> >> >>>>>> I would like to recode my NAs to 0. Using a single vector
> everything is
> >> >>>>>> fine.
> >> >>>>>>
> >> >>>>>> But if I use a data.frame things go wrong:
> >> >>>>>>
> >> >>>>>> -- cut --
> >> >>>>>>
> >> >>>>>> var1 <- c(1:3, NA, 5:7, NA, 9:10)
> >> >>>>>> var2 <- c(1:3, NA, 5:7, NA, 9:10)
> >> >>>>>> ds_test <-
> >> >>>>>> data.frame(var1, var2)
> >> >>>>>>
> >> >>>>>> test <- var1
> >> >>>>>> test[is.na(test)] <- 0
> >> >>>>>> test # NA recoded OK
> >> >>>>>>
> >> >>>>>> # First try
> >> >>>>>> ds_test[is.na(ds_test$var1)] <- 0 # duplicate subscripts WRONG
> >> >>>>>>
> >> >>>>>> # Second try
> >> >>>>>> ds_test[is.na("var1")] <- 0
> >> >>>>>> ds_test$var1 # not recoded WRONG
> >> >>>>>>
> >> >>>>>> # Third try: to me the most intuitive approach
> >> >>>>>> is.na(ds_test["var1"]) <- 0 # attempt to select less than one
> element in
> >> >>>>>> integerOneIndex WRONG
> >> >>>>>>
> >> >>>>>> # Fourth try
> >> >>>>>> ds_test[is.na(var1)] <- 0 # duplicate subscripts for columns
> WRONG
> >> >>>>>>
> >> >>>>>> -- cut --
> >> >>>>>> How can I do it correctly?
> >> >>>>>>
> >> >>>>>> Where could I have found something about it?
> >> >>>>>>
> >> >>>>>> Kind regards
> >> >>>>>>
> >> >>>>>> Georg
> >> >>>>>>
> >> >>>>>> ______________________________________________
> >> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
> see
> >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>>>>> PLEASE do read the posting guide
> >> >>>>>> http://www.R-project.org/posting-guide.html
> >> >>>>>> and provide commented, minimal, self-contained, reproducible
> code.
> >> >>>>>>
> >> >>>>> ______________________________________________
> >> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>>>> PLEASE do read the posting guide
> >> >>>>> http://www.R-project.org/posting-guide.html
> >> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >> >>>>>
> >> >>>> ______________________________________________
> >> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> >>>> and provide commented, minimal, self-contained, reproducible code.
> >> >>
> >> >> ______________________________________________
> >> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> >> and provide commented, minimal, self-contained, reproducible code.
> >> >> ______________________________________________
> >> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list