[R] problem with duplicated function
Rolf Turner
r.turner at auckland.ac.nz
Mon May 25 00:35:56 CEST 2015
On 25/05/15 09:34, Curtis Burkhalter wrote:
> Hello everyone,
>
> I have two very large dataframes (~1 million rows x 5 columns), of which
> two of the columns are lat/long coordinates. The names of the dataframes
> are 'data07' and 'data 08'. Data08 has a few more sampling points than data
> 07 so I want to subset data08 so that it has the same number of data points
> as data07 using the unique lat/long coordinates.
>
> Here are the associated data structures:
>
> *str(data07)*
> 'data.frame': 969109 obs. of 5 variables:
> $ cell : int 710228 715545 720690 720824 695611 700490 700626 705371
> 705507 710363 ...
> $ prN : int 288 276 286 304 258 257 264 272 286 316 ...
> $ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24
> 24 24 24 ...
> $ Xcor : num -111 -111 -111 -111 -111 ...
> $ Ycor : num 41.7 41.7 41.7 41.7 41.8 ...
>
> *str(data08)*
> 'data.frame': 969810 obs. of 5 variables:
> $ cell : int 705528 710321 710456 715677 720762 720896 699953 700635
> 700771 705664 ...
> $ prN : int 293 281 299 278 276 266 282 255 287 280 ...
> $ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23
> 23 23 ...
> $ Xcor : num -111 -111 -111 -111 -111 ...
> $ Ycor : num 41.8 41.7 41.7 41.7 41.7 ...
>
> I've tried using the following code to accomplish my problem:
>
> tt <- rbind(data07, data08)
>
> tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from
> last 2 cols #that correspond to
> the lat/long
I get tt.dup to be:
> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
> [13] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
>
> tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first
> n)
This just throws away the first 10 entries of tt.dup, leaving
> [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
>
> test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08
^
This leaves the c(2,4,5,6,8,10) entries of data08.
>
> When I run the code 'tt.dup' is FALSE for all entries, which I know isn't
> true.
Only 4 of the entries of tt.dup are FALSE; 6 are TRUE. I don't
understand why you think that they are all FALSE.
Perhaps your subsets do not accurately reflect the actual nature of your
data.
cheers,
Rolf Turner
>
> Here's a small subset of the data so that you can see exactly where there
> are duplicates
>
> data07[1:10,]
> cell prN Location Xcor Ycor
> 710229 *710228 288 Sage -111.044 41.7403*
> 715546 *715545 276 Sage -111.044 41.7245*
> 720691 *720690 286 Sage -111.044 41.7131*
> 720825 *720824 304 Sage -111.044 41.7109*
> 695612 695611 258 Sage -111.043 41.7766
> 700491 700490 257 Sage -111.043 41.7653
> 700627 700626 264 Sage -111.043 41.7630
> 705372 705371 272 Sage -111.043 41.7517
> 705508 705507 286 Sage -111.043 41.7495
> 710364 710363 316 Sage -111.043 41.7381
>
> data08[1:10,]
> cell prN Location Xcor Ycor
> 705529 705528 293 Sage -111.044 41.7517
> 710322 *710321 281 Sage -111.044 41.7403*
> 710457 710456 299 Sage -111.044 41.7381
> 715678 *715677 278 Sage -111.044 41.7245*
> 720763 *720762 276 Sage -111.044 41.7131*
> 720897 *720896 266 Sage -111.044 41.7109*
> 699954 699953 282 Sage -111.043 41.7767
> 700636 700635 255 Sage -111.043 41.7653
> 700772 700771 287 Sage -111.043 41.7631
> 705665 705664 280 Sage -111.043 41.7495
>
>
> If anyone has any suggestions as to where I might be going wrong I'd
> greatly appreciate it.
>
> Thank you
>
>
>
>
--
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
Home phone: +64-9-480-4619
More information about the R-help
mailing list