[R] Newbie wants to compare 2 huge RDSs row by row.
Eric Berger
ericjberger at gmail.com
Sun Jan 28 17:47:45 CET 2018
Hi Henrik,
Thanks for pointing out the diffobj package and the clear example. Nice!
On Sun, Jan 28, 2018 at 6:22 PM, Marsh Hardy ARA/RISK <mhardy at ara.com>
wrote:
> Thanks, I think I've found the most succinct expression of differences in
> two data.frames...
>
> length(which( rowSums( x1 != x2 ) > 0))
>
> gives a count of the # of records in two data.frames that do not match.
>
> //
> ________________________________________
> From: Henrik Bengtsson [henrik.bengtsson at gmail.com]
> Sent: Sunday, January 28, 2018 11:12 AM
> To: Ulrik Stervbo
> Cc: Marsh Hardy ARA/RISK; r-help at r-project.org
> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row.
>
> The diffobj package (https://cran.r-project.org/package=diffobj) is
> really helpful here. It provides "diff" functions diffPrint(),
> diffStr(), and diffChr() to compare two object 'x' and 'y' and provide
> neat colorized summary output.
>
> Example:
>
> > iris2 <- iris
> > iris2[122:125,4] <- iris2[122:125,4] + 0.1
>
> > diffobj::diffPrint(iris2, iris)
> < iris2
> > iris
> @@ 121,8 / 121,8 @@
> ~ Sepal.Length Sepal.Width Petal.Length Petal.Width Species
> 120 6.0 2.2 5.0 1.5 virginica
> 121 6.9 3.2 5.7 2.3 virginica
> < 122 5.6 2.8 4.9 2.1 virginica
> > 122 5.6 2.8 4.9 2.0 virginica
> < 123 7.7 2.8 6.7 2.1 virginica
> > 123 7.7 2.8 6.7 2.0 virginica
> < 124 6.3 2.7 4.9 1.9 virginica
> > 124 6.3 2.7 4.9 1.8 virginica
> < 125 6.7 3.3 5.7 2.2 virginica
> > 125 6.7 3.3 5.7 2.1 virginica
> 126 7.2 3.2 6.0 1.8 virginica
> 127 6.2 2.8 4.8 1.8 virginica
>
> What's not show here is that the colored output (supported by many
> terminals these days) also highlights exactly which elements in those
> rows differ.
>
> /Henrik
>
> On Sun, Jan 28, 2018 at 12:17 AM, Ulrik Stervbo <ulrik.stervbo at gmail.com>
> wrote:
> > The anti_join from the package dplyr might also be handy.
> >
> > install.package("dplyr")
> > library(dplyr)
> > anti_join (x1, x2)
> >
> > You can get help on the different functions by ?function.name(), so
> > ?anti_join() will bring you help - and examples - on the anti_join
> > function.
> >
> > It might be worth testing your approach on a small subset of the data.
> That
> > makes it easier for you to follow what happens and evaluate the outcome.
> >
> > HTH
> > Ulrik
> >
> > Marsh Hardy ARA/RISK <mhardy at ara.com> schrieb am So., 28. Jan. 2018,
> 04:14:
> >
> >> Cool, looks like that'd do it, almost as if converting an entire record
> to
> >> a character string and comparing strings.
> >>
> >> ________________________________________
> >> From: William Dunlap [wdunlap at tibco.com]
> >> Sent: Saturday, January 27, 2018 4:57 PM
> >> To: Marsh Hardy ARA/RISK
> >> Cc: Ulrik Stervbo; Eric Berger; r-help at r-project.org
> >> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row.
> >>
> >> If your two objects have class "data.frame" (look at class(objectName))
> >> and they
> >> both have the same number of columns and the same order of columns and
> the
> >> column types match closely enough (use all.equal(x1, x2) for that), then
> >> you can try
> >> which( rowSums( x1 != x2 ) > 0)
> >> E.g.,
> >> > x1 <- data.frame(X=1:5, Y=rep(c("A","B"),c(3,2)))
> >> > x2 <- data.frame(X=c(1,2,-3,-4,5), Y=rep(c("A","B"),c(2,3)))
> >> > x1
> >> X Y
> >> 1 1 A
> >> 2 2 A
> >> 3 3 A
> >> 4 4 B
> >> 5 5 B
> >> > x2
> >> X Y
> >> 1 1 A
> >> 2 2 A
> >> 3 -3 B
> >> 4 -4 B
> >> 5 5 B
> >> > which( rowSums( x1 != x2 ) > 0)
> >> [1] 3 4
> >>
> >> If you want to allow small numeric differences but exactly character
> >> matches
> >> you will have to get a bit fancier. Splitting the data.frames into
> >> character and
> >> numeric parts and comparing each works well.
> >>
> >> Bill Dunlap
> >> TIBCO Software
> >> wdunlap tibco.com<http://tibco.com>
> >>
> >> On Sat, Jan 27, 2018 at 1:18 PM, Marsh Hardy ARA/RISK <mhardy at ara.com
> >> <mailto:mhardy at ara.com>> wrote:
> >> Hi Guys, I apologize for my rank & utter newness at R.
> >>
> >> I used summary() and found about 95 variables, both character and
> numeric,
> >> all with "Length:368842" I assume is the # of records.
> >>
> >> I'd like to know the record number (row #?) of any record where the data
> >> doesn't match in the 2 files of what should be the same output.
> >>
> >> Thanks in advance, M.
> >>
> >> //
> >> ________________________________________
> >> From: Ulrik Stervbo [ulrik.stervbo at gmail.com<mailto:
> >> ulrik.stervbo at gmail.com>]
> >> Sent: Saturday, January 27, 2018 10:00 AM
> >> To: Eric Berger
> >> Cc: Marsh Hardy ARA/RISK; r-help at r-project.org<mailto:r-
> help at r-project.org
> >> >
> >> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row.
> >>
> >> Also, it will be easier to provide helpful information if you'd describe
> >> what in your data you want to compare and what you hope to get out of
> the
> >> comparison.
> >>
> >> Best wishes,
> >> Ulrik
> >>
> >> Eric Berger <ericjberger at gmail.com<mailto:ericjberger at gmail.com
> ><mailto:
> >> ericjberger at gmail.com<mailto:ericjberger at gmail.com>>> schrieb am Sa.,
> 27.
> >> Jan. 2018, 08:18:
> >> Hi Marsh,
> >> An RDS is not a data structure such as a data.frame. It can be anything.
> >> For example if I want to save my objects a, b, c I could do:
> >> > saveRDS( list(a,b,c,), file="tmp.RDS")
> >> Then read them back later with
> >> > myList <- readRDS( "tmp.RDS" )
> >>
> >> Do you have additional information about your "RDSs" ?
> >>
> >> Eric
> >>
> >>
> >> On Sat, Jan 27, 2018 at 6:54 AM, Marsh Hardy ARA/RISK <mhardy at ara.com
> >> <mailto:mhardy at ara.com><mailto:mhardy at ara.com<mailto:mhardy at ara.com>>>
> >> wrote:
> >>
> >> > Each RDS is 40 MBs. What's a slick code to compare them row by row,
> IDing
> >> > row numbers with mismatches?
> >> >
> >> > Thanks in advance.
> >> >
> >> > //
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org<mailto:R-help at r-project.org><mailto:
> >> R-help at r-project.org<mailto:R-help at r-project.org>> mailing list -- To
> >> UNSUBSCRIBE and more, see
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide http://www.R-project.org/
> >> > posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org<mailto:R-help at r-project.org><mailto:
> >> R-help at r-project.org<mailto:R-help at r-project.org>> mailing list -- To
> >> UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
> >> UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >>
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list