[R] Odd Behavior Out of setdiff(...) - addition of duplicate entries is not identified

David Winsemius dwinsemius at comcast.net
Sat May 30 00:35:20 CEST 2009


But I get:

#omitted initial line which would have create an object only to be  
overwritten.

 > Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"),  
Price = c("Low"))
 > Test2_DF<-rbind(Test1_DF, Test1_DF)
 > setdiff(Test1_DF, Test2_DF)
     HouseSize LandLocation Price
1           1         Here   Low
2           2         Here   Low
3           3         Here   Low
4           4         Here   Low
5           5         Here   Low
.... snipped additional 95 rows.

Furthermore I did not load any library (nor did your indicate what  
packages you have loaded), and there does not seem to be a  
setdiff.data.frame in my workspace:
 > setdiff.data.frame
Error: object "setdiff.data.frame" not found

 > sessionInfo()
R version 2.8.1 Patched (2009-01-19 r47650)
i386-apple-darwin9.6.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    splines   stats     graphics  grDevices utils      
datasets  methods   base

other attached packages:
[1] MASS_7.2-46       reshape_0.8.2     plyr_0.1.5         
modeltools_0.2-16 mvtnorm_0.9-4
[6] survival_2.35-4

loaded via a namespace (and not attached):
[1] coin_1.0-1


On May 29, 2009, at 5:58 PM, Jason Rupert wrote:

>
> Jay,
>
>
> Thanks much for the reply.    I think you are right about the prob.  
> Unfortunately, I was not able to find the old emails I had  
> discussing the use of the more powerful setdiff that essentially  
> inherits from the base class R setdiff functionality but extends  
> that functionality by now working with data.frames instead of just a  
> simple array of values.  Love this functionality.
>
> However, for the following example,
> Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"))
> Test1_DF<-data.frame(HouseSize=c(1:100), LandLocation=c("Here"),  
> Price = c("Low"))
> Test2_DF<-rbind(Test1_DF, Test1_DF)
> setdiff(Test1_DF, Test2_DF)
> [1] HouseSize    LandLocation Price
> <0 rows> (or 0-length row.names)
>> setdiff(Test2_DF, Test1_DF)
> [1] HouseSize    LandLocation Price
> <0 rows> (or 0-length row.names)
>
> I was hoping for this example one of the setdiff's would have  
> returned essentially Test1_DF, since it is duplicated and that is  
> what is different between the two dataframes.
>
> So, I guess I am trying to figure out a way to truely diff the  
> dataframes, i.e. determine when two data.frames are different from  
> one another and then receive the output of the results.
>
> Does this capability exist in a function within a current R package  
> or does it exist within a typically used pattern to create this  
> functionality?
>
> Thanks again for any feedback you can provide.
>
>
> Also, I tried to determine my Session Info and the packages I have  
> loaded, but I received the following:
>> sessionInfo()
> Error in x$Priority : $ operator is invalid for atomic vectors
> In addition: There were 12 warnings (use warnings() to see them)
>> warnings()
> Warning messages:
> 1: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'prob' is missing or broken
> 2: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'ggplot2' is missing or broken
> 3: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'reshape' is missing or broken
> 4: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'RColorBrewer' is missing or broken
> 5: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'proto' is missing or broken
> 6: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'plyr' is missing or broken
> 7: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'nortest' is missing or broken
> 8: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'fBasics' is missing or broken
> 9: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'timeSeries' is missing or broken
> 10: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'timeDate' is missing or broken
> 11: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'vcd' is missing or broken
> 12: In FUN(c("prob", "ggplot2", "reshape", "RColorBrewer",  ... :
>  DESCRIPTION file of package 'colorspace' is missing or broken
>
>
> However, I typically load the following ones:
> library(colorspace, lib.loc=RLibraryPathLocation)
> library(vcd, lib.loc=RLibraryPathLocation)
> library(timeDate, lib.loc=RLibraryPathLocation)
> library(timeSeries, lib.loc=RLibraryPathLocation)
> library(fBasics, lib.loc=RLibraryPathLocation)
> library(nortest, lib.loc=RLibraryPathLocation)
> library(plyr, lib.loc=RLibraryPathLocation)
> library(proto, lib.loc=RLibraryPathLocation)
> library(RColorBrewer, lib.loc=RLibraryPathLocation)
> library(reshape, lib.loc=RLibraryPathLocation)
> library(ggplot2, lib.loc=RLibraryPathLocation)
> library(prob, lib.loc=RLibraryPathLocation)
>
>
> --- On Fri, 5/29/09, G. Jay Kerns <gkerns at ysu.edu> wrote:
>
>> From: G. Jay Kerns <gkerns at ysu.edu>
>> Subject: Re: [R] Odd Behavior Out of setdiff(...) - addition of  
>> duplicate  entries is not identified
>> To: "Jason Rupert" <jasonkrupert at yahoo.com>
>> Cc: R-help at r-project.org
>> Date: Friday, May 29, 2009, 3:21 PM
>> Dear Jason,
>>
>> On Fri, May 29, 2009 at 2:48 PM, Jason Rupert  
>> <jasonkrupert at yahoo.com>
>> wrote:
>>>
>>> I think I am using the improved version of
>> setdiff(...) that handles data.frames, so I think some odd
>> behavior was expected but this one is escaping me.
>>>
>>> It appears that the the addition of duplicate entries
>> is not caught by the setdiff(...).  Is this expected
>> behavior?
>>
>> [snip]
>>
>>> Thanks in advance for any feedback.
>>>
>>> Test1_DF<-data.frame(HouseSize=c(1:100))
>>> Test2_DF<-rbind(Test1_DF, Test1_DF)
>>> setdiff(Test1_DF, Test2_DF)
>>> integer(0)
>>> setdiff(Test2_DF, Test1_DF)
>>> integer(0)
>>>
>>> However,
>>> Test3_DF<-data.frame(HouseSize=c(1:25))
>>> setdiff(Test1_DF, Test3_DF)
>>>  [1]  26  27  28  29  30  31  32  33  34
>>  35  36  37  38  39  40  41
>>> [17]  42  43  44  45  46  47  48  49  50  51
>>  52  53  54  55  56  57
>>> [33]  58  59  60  61  62  63  64  65  66  67
>>  68  69  70  71  72  73
>>> [49]  74  75  76  77  78  79  80  81  82  83
>>  84  85  86  87  88  89
>>> [65]  90  91  92  93  94  95  96  97  98  99
>> 100
>>>
>>> setdiff(Test3_DF, Test1_DF)
>>> integer(0)
>>
>>
>> You didn't explicitly say which "improved version" of
>> setdiff() that
>> you are using, so I can only presume that you are using
>> the
>> setdiff.data.frame in the prob package.
>>
>> The behaviour you are observing is expected and matches
>> the
>> base:::setdiff behaviour in the case of vectors;  cf.
>>
>> x1 <- c(1:100)
>> x2 <- c(x1,x1)
>>
>> setdiff(x1, x2)  # integer(0)
>> setdiff(x2, x1)  # integer(0)
>>
>> x3 <- c(1:25)
>> setdiff(x1, x3)  # 26:100
>> setdiff(x3, x1)  # integer(0)
>>
>>
>>>
>>> If so, is there another method or approach that should
>> be used to identify duplicate row entries between two
>> different data frames?
>>>
>>
>> The R-help archives are chock full of every possible
>> variant of
>> questions (and answers) about this, and you haven't said
>> _exactly_
>> what you are looking for. In the absence of an already
>> posted
>> solution, please specify exactly what you want and I'll
>> wager an R
>> Ninja could dispatch it in moments.
>>
>> Regards,
>> Jay
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ***************************************************
>> G. Jay Kerns, Ph.D.
>> Associate Professor
>> Department of Mathematics & Statistics
>> Youngstown State University
>> Youngstown, OH 44555-0002 USA
>> Office: 1035 Cushwa Hall
>> Phone: (330) 941-3310 Office (voice mail)
>> -3302 Department
>> -3170 FAX
>> E-mail: gkerns at ysu.edu
>> http://www.cc.ysu.edu/~gjkerns/
>>
>
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list