[R] select rows with identical columns from a data frame
David Winsemius
dwinsemius at comcast.net
Sun Jan 20 19:37:04 CET 2013
On Jan 20, 2013, at 9:27 AM, David Winsemius wrote:
>
> On Jan 20, 2013, at 8:26 AM, Sam Steingold wrote:
>
>>> * Bert Gunter <thagre.oregba at trar.pbz> [2013-01-19 22:26:46 -0800]:
>>>
>>> But David W. and Bill Dunlap gave you solutions that also work and
>>> are
>>> much faster, no?!
>>
>> Yes, indeed, and I am now using David's solution as it is fast
>> (enough), simple and concise.
>
> I am a bit surprised by that. I do agree that it was simple and
> concise, two programming virtues that I occasionally achieve.
> However, when I tested it against either of Bill Dunlap's
> suggestions mine was 15-40 times slower. (So I saved Bill's code and
> made a mental note to study it's superiority.) I could see why the
> f2 version was superior, since it progressively shrank the index
> candidates for further comparison, but his first function used no
> such logic and was still 15 times faster.
>
> My test included the creation of the smaller data.frame which his
> did not, but when I modified mine to only return the index vector,
> that was the step that consumed all the time. I wondered if it were
> `which` that consumed the time but it appears the inner step of
> x==x[[1]] that was the culprit.
>
> > x <- data.frame(lapply(structure(1:10,names=letters[1:10]),
> function(i) sample(c(NA,1,1,1,2,2,2,3), replace=TRUE, size=1e6)))
>
> > system.time({ keep <- x[[1]] == x[[2]]
> + for (i in seq_len(ncol(x))[-(1:2)]) {
> + keep <- keep & x[[i - 1]] == x[[i]]
> + }
> + z2 <- !is.na(keep) & keep})
> user system elapsed
> 0.179 0.056 0.240
>
> > system.time({z <- rowSums(x==x[[1]]) })
> user system elapsed
> 3.535 0.535 4.067
>
> > system.time({z <- x==x[[1]] })
> user system elapsed
> 3.540 0.524 4.061
>
A further note: Was able to recover most of the timing efficiency with
initial coercion of the dataframe structure to matrix before the "=="
operation:
> system.time({z <- as.matrix(x)==x[[1]] })
user system elapsed
0.181 0.140 0.320
So it's really `==.data.frame` that is the resource hog.
--
David.
> --
> David
>
>
>
>>
>> Thanks a lot to David, Bill, Rui, and arun for their answers (to this
>> question, my many previous questions, and, I hope, my future
>> questions
>> in advance)!
>>
>>> On Sat, Jan 19, 2013 at 9:41 PM, Sam Steingold <sds at gnu.org> wrote:
>>>>> * Rui Barradas <ehvconeenqnf at fncb.cg> [2013-01-18 21:02:20 +0000]:
>>>>>
>>>>> Try the following.
>>>>>
>>>>> complete.cases(f) & apply(f, 1, function(x) all(x == x[1]))
>>>>
>>>> thanks, this works, but is horribly slow (dim(f) is 766,950x2)
>>
> --
>
> David Winsemius, MD
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Alameda, CA, USA
More information about the R-help
mailing list