[R] duplicated() and unique() problems

Tue Jun 8 13:57:51 CEST 2010

On Tuesday, June 8, 2010, christiaan pauw <cjpauw at gmail.com> wrote:
> Hi everybody
>
> I have found something (for me at least) strange with duplicated(). I will
> first provide a replicable example of a certain kind of behaviour that I
> find odd and then give a sample of unexpected results from my own data. I
> hope someone can help me understand this.
>
> Consider the following
>
> # this works as expected
>
> ex=sample(1:20, replace=TRUE)
>
> ex
>
> duplicated(ex)
>
> ex=sort(ex)
>
> ex
>
> duplicated(ex)
>
>
> # but why does duplicate not work after order() ?
>
> ex=sample(1:20, replace=TRUE)
>
> ex
>
> duplicated(ex)
>
> ex=order(ex)
>
> duplicated(ex)
>
> Why does duplicated() not work after order() has been applied but it works
> fine after sort()  ? Is this an error or is there something I don't
> understand.

The latter: order() returns the indexes into your vector, i.e. a
permutation, which select the values in a sorted order. Each element
is unique by definition.

>
> I have been getting very strage results from duplicated() and unique() in a
> dataset I am analysing. Her is a little sample of my real life problem

presumably this is a data.frame...

>
>> str(Masechaba$PROPDESC)
>  Factor w/ 24545 levels "     06","   71Hemilton str",..: 14527 8043 16113
> 16054 13875 15780 12522 7771 14824 12314 ...
>> # Create a indicator if the PROPDESC is unique. Default false
>> Masechaba$unique=FALSE
>> Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE

The statement above is in error. You are referring to elements of
unique(Masechaba$PROPDESC) which do not correspond to the rows of
Masechaba. They are different lengths. Use duplicated() instead.

>> # Check is something happended
>> length(which(Masechaba$unique==TRUE))
> [1] 2174
>> length(which(Masechaba$unique==FALSE))
> [1] 476
>> Masechaba$duplicate=FALSE
>> Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE

equivalent to
Masechaba$duplicate <-  duplicated(Masechaba$PROPDESC)

>> length(which(Masechaba$duplicate==TRUE))
> [1] 476
>> length(which(Masechaba$duplicate==FALSE))
> [1] 2174
>> # Looks OK so far
>> # Test on a known duplicate. I expect one to be true and one to be false
>> Masechaba[which(Masechaba$PROPDESC==2363),10:12]
>       PROPDESC unique duplicate
> 24874     2363   TRUE     FALSE
> 31280     2363   TRUE      TRUE
>
> # This is strange.  I expected that unique() and duplicate() would give the
> same results. The variable PROPDESC is clearly not unique in both cases.
> # The totals are the same but not the individual results
>> table(Masechaba$unique,Masechaba$duplicate)
>
>         FALSE TRUE
>   FALSE   342  134
>   TRUE   1832  342
>
> I don't understand this. Is there something I am missing?
>
> Best regards
> Christaan
>
>
> P.S
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] plyr_0.1.9      maptools_0.7-34 lattice_0.18-8  foreign_0.8-40
>  Hmisc_3.8-0     survival_2.35-8 rgdal_0.6-26
> [8] sp_0.9-64
>
> loaded via a namespace (and not attached):
> [1] cluster_1.12.3 grid_2.11.1    tools_2.11.1
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Felix Andrews / 安福立
Integrated Catchment Assessment and Management (iCAM) Centre
Fenner School of Environment and Society [Bldg 48a]
The Australian National University
Canberra ACT 0200 Australia
M: +61 410 400 963
T: + 61 2 6125 4670
E: felix.andrews at anu.edu.au
CRICOS Provider No. 00120C
-- 
http://www.neurofractal.org/felix/