[R] how to subset based on other row values and multiplicity
William Dunlap
wdunlap at tibco.com
Wed Jul 16 22:24:56 CEST 2014
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
Note that the 'date == min(date)' will cause superfluous output rows
when there are several readings on initial date for a given id/value
pair. E.g.,
> dat1 <- data.frame(stringsAsFactors=FALSE, id=rep("A", 4), value=rep("x", 4), date=as.Date("2000-10-1")+c(1,1,50,50))
> f2(dat1) # want 1 output row: A, x, 2000-10-2
Source: local data frame [2 x 3]
Groups: id, value
id value date
1 A x 2000-10-02
2 A x 2000-10-02
where f2 is your code wrapped up in a function (to make testing and use easier)
f2 <- function (data)
{
library(dplyr)
data %>% group_by(id, value) %>% arrange(date = as.Date(date)) %>%
filter(any(c(abs(diff(as.Date(date))), NA) > 31) & date == min(date))
}
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Jul 16, 2014 at 7:49 AM, arun <smartpink111 at yahoo.com> wrote:
> Hi,
> If `dat` is the dataset
>
> library(dplyr)
> dat%>%
> group_by(id,value)%>%
>
> arrange(date=as.Date(date))%>%
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
> #Source: local data frame [3 x 3]
> #Groups: id, value
> #
> # id date value
> #1 a 2000-01-01 x
> #2 c 2000-09-10 y
> #3 c 2000-10-11 z
> A.K.
>
>
>
>
> On Wednesday, July 16, 2014 9:10 AM, Williams Scott <Scott.Williams at petermac.org> wrote:
> Hi R experts,
>
> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
> in an individual (Œid¹) if they occur
> more than once at least 30 days apart.
>
>
> id date value
> a 2000-01-01 x
> a 2000-03-01 x
> b 2000-11-11 w
> c 2000-11-11 y
> c 2000-10-01 y
> c 2000-09-10 y
> c 2000-12-12 z
> c 2000-10-11 z
> d 2000-11-11 w
> d 2000-11-10 w
>
>
> I wish to subset the data to retain rows where the value for the
> individual is confirmed more than 30 days apart. So, after deleting all
> rows with just one occurrence of id and value, the rest would be the
> earliest occurrence of each value in each case id, provided 31 or more
> days exist between the dates. If >1 value is present per id, each value
> level needs to be assessed independently. This example would then reduce
> to:
>
>
> id date value
> a 2000-01-01 x
> c 2000-09-10 y
> c 2000-10-11 z
>
>
>
> I can do this via some crude loops and subsetting, but I am looking for as
> much efficiency as possible
> as the dataset has around 50 million rows to assess. Any suggestions
> welcomed.
>
> Thanks in advance
>
> Scott Williams MD
> Melbourne, Australia
>
>
>
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee. If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email. Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list