[R] Duplicates and duplicated
William Dunlap
wdunlap at tibco.com
Fri May 15 00:17:17 CEST 2009
> -----Original Message-----
> From: Bert Gunter [mailto:gunter.berton at gene.com]
> Sent: Thursday, May 14, 2009 2:31 PM
> To: William Dunlap; 'Gabor Grothendieck'; 'christiaan pauw';
> 'jim holtman'
> Cc: r-help at r-project.org
> Subject: RE: [R] Duplicates and duplicated
>
>
> Thanks, Bill. I also had some concerns about how reliable
> numeric values
> converted to character might be, so I'm glad to have an authoritative
> criticism. Of course, I was really just being cute with R's
> versatility.
>
> But Jim Holtman's solution seems like the best way to go,
> anyway, does it
> not?
That was
f3 <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)
which is equivalent to
function(x) duplicated(x) | rev(duplicated(rev(x)))
in S+, which doesn't have the fromLast= argument.
It avoids the problems involved in table() and ave(),
but it just seems sneaky to me.
Linlin Yan's
f4 <- function(x) x %in% x[duplicated(x)]
seems to me more direct and also avoids those problems.
Mine was wrong. It fails on
x <- c(1, 2, 8, 2, 4, 5, 10, 1, 4, 16, 2)
My intent was to provide one that would generalize to identifiying
all elements that had n or more repetitions in the input vector.
(E.g., you may want to drop from some analysis subjects with
fewer than 5 observations on them.) The corrected version is
f2<-function(x, n=2){
ix<-match(x,x);
tix<-tabulate(ix);
ix %in% which(tix>=n)
}
E.g.,
> rbind(x, f2(x), f3(x), f4(x)) # identify duplicated entries
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x 1 2 8 2 4 5 10 1 4 16 2
1 1 0 1 1 0 0 1 1 0 1
1 1 0 1 1 0 0 1 1 0 1
1 1 0 1 1 0 0 1 1 0 1
> rbind(x, f2(x, n=3)) # find ones with >= 3 reps
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x 1 2 8 2 4 5 10 1 4 16 2
0 1 0 1 0 0 0 0 0 0 1
>
> -- Bert
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
> -----Original Message-----
> From: William Dunlap [mailto:wdunlap at tibco.com]
> Sent: Thursday, May 14, 2009 10:44 AM
> To: Bert Gunter; Gabor Grothendieck; christiaan pauw
> Cc: r-help at r-project.org
> Subject: RE: [R] Duplicates and duplicated
>
> The table()-based solution can have problems when there are
> very closely spaced floating point numbers in x, as in
> x1<-c(1, 1-.Machine$double.eps,
> 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
> It also relies on table(x) turning x into a factor with the default
> levels=as.character(sort(x)) and that default may change.
> It omits NA's from the result. (I think it also ought to put
> the results in
> the original order of the data, so one can, e.g., omit or
> select values
> which are duplicated.)
>
> The ave()-based solution fails when there are NA's or NaN's
> in the data.
> x2 <- c(1,2,3,NA,10,6,3)
>
> The ave()-based solution can be slower than necessary on long
> datasets,
> especially ones with few or no duplicates.
> x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
>
> I think the following function avoids these problems. It
> never converts
> the data to character, but uses match() on the original data
> to convert
> it to a set of unique integers that tabulate can handle.
>
> f2 <- function(x){
> ix<-match(x,x)
> tix<-tabulate(ix)
> retval<-logical(length(x))
> retval[which(tix!=1)]<-TRUE
> retval
> }
>
> Bill Dunlap
> TIBCO Software Inc - Spotfire Division
> wdunlap tibco.com
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> > Sent: Thursday, May 14, 2009 9:10 AM
> > To: 'Gabor Grothendieck'; 'christiaan pauw'
> > Cc: r-help at r-project.org
> > Subject: Re: [R] Duplicates and duplicated
> >
> > ... or, similar in character to Gabor's solution:
> >
> > tbl <- table(x)
> > (tbl[as.character(sort(x))]>1)+0
> >
> >
> > Bert Gunter
> > Nonclinical Biostatistics
> > 467-7374
> >
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On
> > Behalf Of Gabor Grothendieck
> > Sent: Thursday, May 14, 2009 7:34 AM
> > To: christiaan pauw
> > Cc: r-help at r-project.org
> > Subject: Re: [R] Duplicates and duplicated
> >
> > Noting that:
> >
> > > ave(x, x, FUN = length) > 1
> > [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
> >
> > try this:
> >
> > > rbind(x, dup = ave(x, x, FUN = length) > 1)
> > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > x 1 2 3 4 4 5 6 7 8 9
> > dup 0 0 0 1 1 0 0 0 0 0
> >
> >
> > On Thu, May 14, 2009 at 2:16 AM, christiaan pauw
> > <cjpauw at gmail.com> wrote:
> > > Hi everybody.
> > > I want to identify not only duplicate number but also the
> > original number
> > > that has been duplicated.
> > > Example:
> > > x=c(1,2,3,4,4,5,6,7,8,9)
> > > y=duplicated(x)
> > > rbind(x,y)
> > >
> > > gives:
> > > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > > x 1 2 3 4 4 5 6 7 8 9
> > > y 0 0 0 0 1 0 0 0 0 0
> > >
> > > i.e. the second 4 [,5] is a duplicate.
> > >
> > > What I want is the first and second 4. i.e [,4] and [,5]
> to be TRUE
> > >
> > > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > > x 1 2 3 4 4 5 6 7 8 9
> > > y 0 0 0 1 1 0 0 0 0 0
> > >
> > > I assume it can be done by sorting the vector and then
> > checking is the
> > next
> > > or the previous entry matches using
> > > identical() . I am just unsure on how to write such a loop
> > the logic of
> > > which (I think) is as follows:
> > >
> > > sort x
> > > for every value of x check if the next value is identical
> > and return TRUE
> > > (or 1) if it is and FALSE (or 0) if it is not
> > > AND
> > > check is the previous value is identical and return TRUE
> > (or 1) if it is
> > and
> > > FALSE (or 0) if it is not
> > >
> > > Im i thinking correct and can some help to write such a function
> > >
> > > regards
> > > Christiaan
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
More information about the R-help
mailing list