[R] Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified
Jason Rupert
jasonkrupert at yahoo.com
Fri May 29 23:58:50 CEST 2009
Jay,
Thanks much for the reply. I think you are right about the prob. Unfortunately, I was not able to find the old emails I had discussing the use of the more powerful setdiff that essentially inherits from the base class R setdiff functionality but extends that functionality by now working with data.frames instead of just a simple array of values. Love this functionality.
However, for the following example,
Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"))
Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"), Price = c("Low"))
Test2_DF<-rbind(Test1_DF, Test1_DF)
setdiff(Test1_DF, Test2_DF)
[1] HouseSize LandLocation Price
<0 rows> (or 0-length row.names)
> setdiff(Test2_DF, Test1_DF)
[1] HouseSize LandLocation Price
<0 rows> (or 0-length row.names)
I was hoping for this example one of the setdiff's would have returned essentially Test1_DF, since it is duplicated and that is what is different between the two dataframes.
So, I guess I am trying to figure out a way to truely diff the dataframes, i.e. determine when two data.frames are different from one another and then receive the output of the results.
Does this capability exist in a function within a current R package or does it exist within a typically used pattern to create this functionality?
Thanks again for any feedback you can provide.
Also, I tried to determine my Session Info and the packages I have loaded, but I received the following:
> sessionInfo()
Error in x$Priority : $ operator is invalid for atomic vectors
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'prob' is missing or broken
2: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'ggplot2' is missing or broken
3: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'reshape' is missing or broken
4: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'RColorBrewer' is missing or broken
5: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'proto' is missing or broken
6: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'plyr' is missing or broken
7: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'nortest' is missing or broken
8: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'fBasics' is missing or broken
9: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'timeSeries' is missing or broken
10: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'timeDate' is missing or broken
11: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'vcd' is missing or broken
12: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
DESCRIPTION file of package 'colorspace' is missing or broken
However, I typically load the following ones:
library(colorspace, lib.loc=RLibraryPathLocation)
library(vcd, lib.loc=RLibraryPathLocation)
library(timeDate, lib.loc=RLibraryPathLocation)
library(timeSeries, lib.loc=RLibraryPathLocation)
library(fBasics, lib.loc=RLibraryPathLocation)
library(nortest, lib.loc=RLibraryPathLocation)
library(plyr, lib.loc=RLibraryPathLocation)
library(proto, lib.loc=RLibraryPathLocation)
library(RColorBrewer, lib.loc=RLibraryPathLocation)
library(reshape, lib.loc=RLibraryPathLocation)
library(ggplot2, lib.loc=RLibraryPathLocation)
library(prob, lib.loc=RLibraryPathLocation)
--- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:
> From: G. Jay Kerns <gkerns at ysu.edu>
> Subject: Re: [R] Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified
> To: "Jason Rupert" <jasonkrupert at yahoo.com>
> Cc: R-help at r-project.org
> Date: Friday, May 29, 2009, 3:21 PM
> Dear Jason,
>
> On Fri, May 29, 2009 at 2:48 PM, Jason Rupert <jasonkrupert at yahoo.com>
> wrote:
> >
> > I think I am using the improved version of
> setdiff(...) that handles data.frames, so I think some odd
> behavior was expected but this one is escaping me.
> >
> > It appears that the the addition of duplicate entries
> is not caught by the setdiff(...). Is this expected
> behavior?
>
> [snip]
>
> > Thanks in advance for any feedback.
> >
> > Test1_DF<-data.frame(HouseSize=c(1:100))
> > Test2_DF<-rbind(Test1_DF, Test1_DF)
> > setdiff(Test1_DF, Test2_DF)
> > integer(0)
> > setdiff(Test2_DF, Test1_DF)
> > integer(0)
> >
> > However,
> > Test3_DF<-data.frame(HouseSize=c(1:25))
> > setdiff(Test1_DF, Test3_DF)
> > [1] 26 27 28 29 30 31 32 33 34
> 35 36 37 38 39 40 41
> > [17] 42 43 44 45 46 47 48 49 50 51
> 52 53 54 55 56 57
> > [33] 58 59 60 61 62 63 64 65 66 67
> 68 69 70 71 72 73
> > [49] 74 75 76 77 78 79 80 81 82 83
> 84 85 86 87 88 89
> > [65] 90 91 92 93 94 95 96 97 98 99
> 100
> >
> > setdiff(Test3_DF, Test1_DF)
> > integer(0)
>
>
> You didn't explicitly say which "improved version" of
> setdiff() that
> you are using, so I can only presume that you are using
> the
> setdiff.data.frame in the prob package.
>
> The behaviour you are observing is expected and matches
> the
> base:::setdiff behaviour in the case of vectors; cf.
>
> x1 <- c(1:100)
> x2 <- c(x1,x1)
>
> setdiff(x1, x2) # integer(0)
> setdiff(x2, x1) # integer(0)
>
> x3 <- c(1:25)
> setdiff(x1, x3) # 26:100
> setdiff(x3, x1) # integer(0)
>
>
> >
> > If so, is there another method or approach that should
> be used to identify duplicate row entries between two
> different data frames?
> >
>
> The R-help archives are chock full of every possible
> variant of
> questions (and answers) about this, and you haven't said
> _exactly_
> what you are looking for. In the absence of an already
> posted
> solution, please specify exactly what you want and I'll
> wager an R
> Ninja could dispatch it in moments.
>
> Regards,
> Jay
>
>
>
>
>
>
>
>
>
> ***************************************************
> G. Jay Kerns, Ph.D.
> Associate Professor
> Department of Mathematics & Statistics
> Youngstown State University
> Youngstown, OH 44555-0002 USA
> Office: 1035 Cushwa Hall
> Phone: (330) 941-3310 Office (voice mail)
> -3302 Department
> -3170 FAX
> E-mail: gkerns at ysu.edu
> http://www.cc.ysu.edu/~gjkerns/
>
More information about the R-help
mailing list