[R] Tricky filtering
PIKAL Petr
petr@p|k@| @end|ng |rom prechez@@cz
Thu Oct 31 08:29:33 CET 2019
Hi.
Bert's questions should be clarified. But from your question I understand
that only ANT01 and ANT02 are the Stations which you want to filter and all
others you want to keep regardless of condition. If this is true, I would
add the new column which would have one value for ANT stations and different
for all others (if you have more than one). Than you could set flag which is
the biggest number in each day. And after that you could add in each day
stations different from ANT and want to keep.
I named your data as test and change them to data frame as I am not familiar
with tibbles.
The code is like that.
test$m <- ave(test$N_records, interaction(test$Date, test$Station),
FUN=mean)
test$flag <- ave(test$m, test$Date, FUN=function(x) max(x) == x)
test$keep <- test$flag + (test$Station == "ETE01")*1
but you need to think about questions asked by Bert.
Cheers
Petr
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Bert Gunter
> Sent: Thursday, October 31, 2019 5:18 AM
> To: Cacique Samurai <caciquesamurai using gmail.com>
> Cc: R help <r-help using r-project.org>
> Subject: Re: [R] Tricky filtering
>
> Thanks for the nice dput example, but your specification confuses me.
> What if the 2 records with largest Mean_power are not the same as the two
> with largest N_records. Do you want to keep all four records? Or various
> combinations of this question that would keep 3 records. And will you
> always have two records on a date, or could you have just one? And if the
2
> records with largest Mean_power always also have the largest N_records,
> then you only need to choose the two with largest Mean_power and can
> ignore the N_records, right?
>
> Once you have answered these questions -- or someone else has a better
> understanding than I -- it should be easy. It will require a loop of one
form or
> another, however, and therefore might take a while.
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Oct 30, 2019 at 7:55 PM Cacique Samurai
> <caciquesamurai using gmail.com>
> wrote:
>
> > Hi all,
> >
> > I had a fish telemetry data with more then 11 million lines. I had
> > some false records in the data, that I have to eliminate. I can solve
> > this using a loop, but I think that dplyr:: filter could be faster and
> > elegant. I just can't figure out how to do it.
> >
> > At this moment, I already summarized this raw data, and had something
> > like this (dput at end of e-mail):
> >
> > Date Station Antenna Mean_power N_records *Action need (manually
> > inserted)*
> > 29/03/2019 ANT01 1 108 1704 Remove
> > 29/03/2019 ANT01 2 94 1219 Remove
> > 29/03/2019 ANT02 1 220 3029 Keep
> > 29/03/2019 ANT02 2 219 2711 Keep
> > 30/03/2019 ANT01 1 204 2289 Keep
> > 30/03/2019 ANT01 2 172 1477 Keep
> > 30/03/2019 ANT02 1 88 913 Remove
> > 30/03/2019 ANT02 2 72 1080 Remove
> > 30/03/2019 ETE01 AH0 87 1 Keep
> >
> > The problem occurs between Stations ANT01 and ANT02. In the same day,
> > I have to keep the pair of records that have bigger Mean_power and
> > more N_records. In this example, I have to keep records in Station
> > ANT02 in
> > 29/03 and of ANT01 and ETE01 in 30/03. If I do not have more than
> > ANT01 and
> > ANT02 in the same day, it was a simple question.
> >
> > I have to do this for each marked fish, that is identified by a Code
> > supres here for resuming.
> >
> > Thanks in advanced,
> >
> > Raoni
> >
> >
> > structure(list(Date = structure(c(17984, 17984, 17984, 17984, 17985,
> > 17985, 17985, 17985, 17985), class = "Date"), Station =
> > c("ANT01","ANT01", "ANT02", "ANT02", "ANT01", "ANT01", "ANT02",
> > "ANT02","ETE01"), Antenna = c("1", "2", "1", "2", "1", "2", "1",
> > "2","AH0"), Media_power = c(108, 94, 220, 219, 204, 172, 88, 72, 87),
> > N_records = c(1704L, 1219L, 3029L, 2711L, 2289L, 1477L, 913L, 1080L,
> > 1L)), row.names = c(NA, -9L), class = c("grouped_df", "tbl_df", "tbl",
> > "data.frame"), groups = structure(list(Date = structure(c(17984,
> > 17984, 17985, 17985, 17985), class = "Date"), Station = c("ANT01",
> > "ANT02", "ANT01", "ANT02", "ETE01"), .rows = list(1:2, 3:4, 5:6, 7:8,
> > 9L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
> > "data.frame"), .drop = TRUE))
> >
> >
> >
> >
> >
> >
> >
> > --
> > Raoni Rosa Rodrigues
> > Research Associate of Fish Transposition Center CTPeixes Universidade
> > Federal de Minas Gerais - UFMG Brasil rodrigues.raoni using gmail.com
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list