[R] Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified
David Winsemius
dwinsemius at comcast.net
Sat May 30 00:35:20 CEST 2009
But I get:
#omitted initial line which would have create an object only to be
overwritten.
> Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"),
Price = c("Low"))
> Test2_DF<-rbind(Test1_DF, Test1_DF)
> setdiff(Test1_DF, Test2_DF)
HouseSize LandLocation Price
1 1 Here Low
2 2 Here Low
3 3 Here Low
4 4 Here Low
5 5 Here Low
.... snipped additional 95 rows.
Furthermore I did not load any library (nor did your indicate what
packages you have loaded), and there does not seem to be a
setdiff.data.frame in my workspace:
> setdiff.data.frame
Error: object "setdiff.data.frame" not found
> sessionInfo()
R version 2.8.1 Patched (2009-01-19 r47650)
i386-apple-darwin9.6.0
locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 splines stats graphics grDevices utils
datasets methods base
other attached packages:
[1] MASS_7.2-46 reshape_0.8.2 plyr_0.1.5
modeltools_0.2-16 mvtnorm_0.9-4
[6] survival_2.35-4
loaded via a namespace (and not attached):
[1] coin_1.0-1
On May 29, 2009, at 5:58 PM, Jason Rupert wrote:
>
> Jay,
>
>
> Thanks much for the reply. I think you are right about the prob.
> Unfortunately, I was not able to find the old emails I had
> discussing the use of the more powerful setdiff that essentially
> inherits from the base class R setdiff functionality but extends
> that functionality by now working with data.frames instead of just a
> simple array of values. Love this functionality.
>
> However, for the following example,
> Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"))
> Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"),
> Price = c("Low"))
> Test2_DF<-rbind(Test1_DF, Test1_DF)
> setdiff(Test1_DF, Test2_DF)
> [1] HouseSize LandLocation Price
> <0 rows> (or 0-length row.names)
>> setdiff(Test2_DF, Test1_DF)
> [1] HouseSize LandLocation Price
> <0 rows> (or 0-length row.names)
>
> I was hoping for this example one of the setdiff's would have
> returned essentially Test1_DF, since it is duplicated and that is
> what is different between the two dataframes.
>
> So, I guess I am trying to figure out a way to truely diff the
> dataframes, i.e. determine when two data.frames are different from
> one another and then receive the output of the results.
>
> Does this capability exist in a function within a current R package
> or does it exist within a typically used pattern to create this
> functionality?
>
> Thanks again for any feedback you can provide.
>
>
> Also, I tried to determine my Session Info and the packages I have
> loaded, but I received the following:
>> sessionInfo()
> Error in x$Priority : $ operator is invalid for atomic vectors
> In addition: There were 12 warnings (use warnings() to see them)
>> warnings()
> Warning messages:
> 1: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'prob' is missing or broken
> 2: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'ggplot2' is missing or broken
> 3: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'reshape' is missing or broken
> 4: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'RColorBrewer' is missing or broken
> 5: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'proto' is missing or broken
> 6: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'plyr' is missing or broken
> 7: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'nortest' is missing or broken
> 8: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'fBasics' is missing or broken
> 9: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'timeSeries' is missing or broken
> 10: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'timeDate' is missing or broken
> 11: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'vcd' is missing or broken
> 12: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer", ... :
> DESCRIPTION file of package 'colorspace' is missing or broken
>
>
> However, I typically load the following ones:
> library(colorspace, lib.loc=RLibraryPathLocation)
> library(vcd, lib.loc=RLibraryPathLocation)
> library(timeDate, lib.loc=RLibraryPathLocation)
> library(timeSeries, lib.loc=RLibraryPathLocation)
> library(fBasics, lib.loc=RLibraryPathLocation)
> library(nortest, lib.loc=RLibraryPathLocation)
> library(plyr, lib.loc=RLibraryPathLocation)
> library(proto, lib.loc=RLibraryPathLocation)
> library(RColorBrewer, lib.loc=RLibraryPathLocation)
> library(reshape, lib.loc=RLibraryPathLocation)
> library(ggplot2, lib.loc=RLibraryPathLocation)
> library(prob, lib.loc=RLibraryPathLocation)
>
>
> --- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:
>
>> From: G. Jay Kerns <gkerns at ysu.edu>
>> Subject: Re: [R] Odd Behavior Out of setdiff(...) - addition of
>> duplicate entries is not identified
>> To: "Jason Rupert" <jasonkrupert at yahoo.com>
>> Cc: R-help at r-project.org
>> Date: Friday, May 29, 2009, 3:21 PM
>> Dear Jason,
>>
>> On Fri, May 29, 2009 at 2:48 PM, Jason Rupert
>> <jasonkrupert at yahoo.com>
>> wrote:
>>>
>>> I think I am using the improved version of
>> setdiff(...) that handles data.frames, so I think some odd
>> behavior was expected but this one is escaping me.
>>>
>>> It appears that the the addition of duplicate entries
>> is not caught by the setdiff(...). Is this expected
>> behavior?
>>
>> [snip]
>>
>>> Thanks in advance for any feedback.
>>>
>>> Test1_DF<-data.frame(HouseSize=c(1:100))
>>> Test2_DF<-rbind(Test1_DF, Test1_DF)
>>> setdiff(Test1_DF, Test2_DF)
>>> integer(0)
>>> setdiff(Test2_DF, Test1_DF)
>>> integer(0)
>>>
>>> However,
>>> Test3_DF<-data.frame(HouseSize=c(1:25))
>>> setdiff(Test1_DF, Test3_DF)
>>> [1] 26 27 28 29 30 31 32 33 34
>> 35 36 37 38 39 40 41
>>> [17] 42 43 44 45 46 47 48 49 50 51
>> 52 53 54 55 56 57
>>> [33] 58 59 60 61 62 63 64 65 66 67
>> 68 69 70 71 72 73
>>> [49] 74 75 76 77 78 79 80 81 82 83
>> 84 85 86 87 88 89
>>> [65] 90 91 92 93 94 95 96 97 98 99
>> 100
>>>
>>> setdiff(Test3_DF, Test1_DF)
>>> integer(0)
>>
>>
>> You didn't explicitly say which "improved version" of
>> setdiff() that
>> you are using, so I can only presume that you are using
>> the
>> setdiff.data.frame in the prob package.
>>
>> The behaviour you are observing is expected and matches
>> the
>> base:::setdiff behaviour in the case of vectors; cf.
>>
>> x1 <- c(1:100)
>> x2 <- c(x1,x1)
>>
>> setdiff(x1, x2) # integer(0)
>> setdiff(x2, x1) # integer(0)
>>
>> x3 <- c(1:25)
>> setdiff(x1, x3) # 26:100
>> setdiff(x3, x1) # integer(0)
>>
>>
>>>
>>> If so, is there another method or approach that should
>> be used to identify duplicate row entries between two
>> different data frames?
>>>
>>
>> The R-help archives are chock full of every possible
>> variant of
>> questions (and answers) about this, and you haven't said
>> _exactly_
>> what you are looking for. In the absence of an already
>> posted
>> solution, please specify exactly what you want and I'll
>> wager an R
>> Ninja could dispatch it in moments.
>>
>> Regards,
>> Jay
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ***************************************************
>> G. Jay Kerns, Ph.D.
>> Associate Professor
>> Department of Mathematics & Statistics
>> Youngstown State University
>> Youngstown, OH 44555-0002 USA
>> Office: 1035 Cushwa Hall
>> Phone: (330) 941-3310 Office (voice mail)
>> -3302 Department
>> -3170 FAX
>> E-mail: gkerns at ysu.edu
>> http://www.cc.ysu.edu/~gjkerns/
>>
>
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list