[R] Newbie wants to compare 2 huge RDSs row by row.

Eric Berger ericjberger at gmail.com
Sun Jan 28 17:47:45 CET 2018


Hi Henrik,
Thanks for pointing out the diffobj package and the clear example. Nice!


On Sun, Jan 28, 2018 at 6:22 PM, Marsh Hardy ARA/RISK <mhardy at ara.com>
wrote:

> Thanks, I think I've found the most succinct expression of differences in
> two data.frames...
>
> length(which( rowSums( x1 != x2 ) > 0))
>
> gives a count of the # of records in two data.frames that do not match.
>
> //
> ________________________________________
> From: Henrik Bengtsson [henrik.bengtsson at gmail.com]
> Sent: Sunday, January 28, 2018 11:12 AM
> To: Ulrik Stervbo
> Cc: Marsh Hardy ARA/RISK; r-help at r-project.org
> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row.
>
> The diffobj package (https://cran.r-project.org/package=diffobj) is
> really helpful here.  It provides "diff" functions diffPrint(),
> diffStr(), and diffChr() to compare two object 'x' and 'y' and provide
> neat colorized summary output.
>
> Example:
>
> > iris2 <- iris
> > iris2[122:125,4] <- iris2[122:125,4] + 0.1
>
> > diffobj::diffPrint(iris2, iris)
> < iris2
> > iris
> @@ 121,8 / 121,8 @@
> ~     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
>   120          6.0         2.2          5.0         1.5  virginica
>   121          6.9         3.2          5.7         2.3  virginica
> < 122          5.6         2.8          4.9         2.1  virginica
> > 122          5.6         2.8          4.9         2.0  virginica
> < 123          7.7         2.8          6.7         2.1  virginica
> > 123          7.7         2.8          6.7         2.0  virginica
> < 124          6.3         2.7          4.9         1.9  virginica
> > 124          6.3         2.7          4.9         1.8  virginica
> < 125          6.7         3.3          5.7         2.2  virginica
> > 125          6.7         3.3          5.7         2.1  virginica
>   126          7.2         3.2          6.0         1.8  virginica
>   127          6.2         2.8          4.8         1.8  virginica
>
> What's not show here is that the colored output (supported by many
> terminals these days) also highlights exactly which elements in those
> rows differ.
>
> /Henrik
>
> On Sun, Jan 28, 2018 at 12:17 AM, Ulrik Stervbo <ulrik.stervbo at gmail.com>
> wrote:
> > The anti_join from the package dplyr might also be handy.
> >
> > install.package("dplyr")
> > library(dplyr)
> > anti_join (x1, x2)
> >
> > You can get help on the different functions by ?function.name(), so
> > ?anti_join() will bring you help - and examples - on the anti_join
> > function.
> >
> > It might be worth testing your approach on a small subset of the data.
> That
> > makes it easier for you to follow what happens and evaluate the outcome.
> >
> > HTH
> > Ulrik
> >
> > Marsh Hardy ARA/RISK <mhardy at ara.com> schrieb am So., 28. Jan. 2018,
> 04:14:
> >
> >> Cool, looks like that'd do it, almost as if converting an entire record
> to
> >> a character string and comparing strings.
> >>
> >> ________________________________________
> >> From: William Dunlap [wdunlap at tibco.com]
> >> Sent: Saturday, January 27, 2018 4:57 PM
> >> To: Marsh Hardy ARA/RISK
> >> Cc: Ulrik Stervbo; Eric Berger; r-help at r-project.org
> >> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row.
> >>
> >> If your two objects have class "data.frame" (look at class(objectName))
> >> and they
> >> both have the same number of columns and the same order of columns and
> the
> >> column types match closely enough (use all.equal(x1, x2) for that), then
> >> you can try
> >>      which( rowSums( x1 != x2 ) > 0)
> >> E.g.,
> >> > x1 <- data.frame(X=1:5, Y=rep(c("A","B"),c(3,2)))
> >> > x2 <- data.frame(X=c(1,2,-3,-4,5), Y=rep(c("A","B"),c(2,3)))
> >> > x1
> >>   X Y
> >> 1 1 A
> >> 2 2 A
> >> 3 3 A
> >> 4 4 B
> >> 5 5 B
> >> > x2
> >>    X Y
> >> 1  1 A
> >> 2  2 A
> >> 3 -3 B
> >> 4 -4 B
> >> 5  5 B
> >> > which( rowSums( x1 != x2 ) > 0)
> >> [1] 3 4
> >>
> >> If you want to allow small numeric differences but exactly character
> >> matches
> >> you will have to get a bit fancier.  Splitting the data.frames into
> >> character and
> >> numeric parts and comparing each works well.
> >>
> >> Bill Dunlap
> >> TIBCO Software
> >> wdunlap tibco.com<http://tibco.com>
> >>
> >> On Sat, Jan 27, 2018 at 1:18 PM, Marsh Hardy ARA/RISK <mhardy at ara.com
> >> <mailto:mhardy at ara.com>> wrote:
> >> Hi Guys, I apologize for my rank & utter newness at R.
> >>
> >> I used summary() and found about 95 variables, both character and
> numeric,
> >> all with "Length:368842" I assume is the # of records.
> >>
> >> I'd like to know the record number (row #?) of any record where the data
> >> doesn't match in the 2 files of what should be the same output.
> >>
> >> Thanks in advance, M.
> >>
> >> //
> >> ________________________________________
> >> From: Ulrik Stervbo [ulrik.stervbo at gmail.com<mailto:
> >> ulrik.stervbo at gmail.com>]
> >> Sent: Saturday, January 27, 2018 10:00 AM
> >> To: Eric Berger
> >> Cc: Marsh Hardy ARA/RISK; r-help at r-project.org<mailto:r-
> help at r-project.org
> >> >
> >> Subject: Re: [R] Newbie wants to compare 2 huge RDSs row by row.
> >>
> >> Also, it will be easier to provide helpful information if you'd describe
> >> what in your data you want to compare and what you hope to get out of
> the
> >> comparison.
> >>
> >> Best wishes,
> >> Ulrik
> >>
> >> Eric Berger <ericjberger at gmail.com<mailto:ericjberger at gmail.com
> ><mailto:
> >> ericjberger at gmail.com<mailto:ericjberger at gmail.com>>> schrieb am Sa.,
> 27.
> >> Jan. 2018, 08:18:
> >> Hi Marsh,
> >> An RDS is not a data structure such as a data.frame. It can be anything.
> >> For example if I want to save my objects a, b, c I could do:
> >> > saveRDS( list(a,b,c,), file="tmp.RDS")
> >> Then read them back later with
> >> > myList <- readRDS( "tmp.RDS" )
> >>
> >> Do you have additional information about your "RDSs" ?
> >>
> >> Eric
> >>
> >>
> >> On Sat, Jan 27, 2018 at 6:54 AM, Marsh Hardy ARA/RISK <mhardy at ara.com
> >> <mailto:mhardy at ara.com><mailto:mhardy at ara.com<mailto:mhardy at ara.com>>>
> >> wrote:
> >>
> >> > Each RDS is 40 MBs. What's a slick code to compare them row by row,
> IDing
> >> > row numbers with mismatches?
> >> >
> >> > Thanks in advance.
> >> >
> >> > //
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org<mailto:R-help at r-project.org><mailto:
> >> R-help at r-project.org<mailto:R-help at r-project.org>> mailing list -- To
> >> UNSUBSCRIBE and more, see
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide http://www.R-project.org/
> >> > posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org<mailto:R-help at r-project.org><mailto:
> >> R-help at r-project.org<mailto:R-help at r-project.org>> mailing list -- To
> >> UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
> >> UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >>
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list