[R] how to subset based on other row values and multiplicity

Wed Jul 16 22:24:56 CEST 2014

> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))

Note that the 'date == min(date)' will cause superfluous output rows
when there are several readings on initial date for a given id/value
pair.  E.g.,

> dat1 <- data.frame(stringsAsFactors=FALSE, id=rep("A", 4), value=rep("x", 4), date=as.Date("2000-10-1")+c(1,1,50,50))
> f2(dat1) # want 1 output row: A, x, 2000-10-2
Source: local data frame [2 x 3]
Groups: id, value

  id value       date
1  A     x 2000-10-02
2  A     x 2000-10-02

where f2 is your code wrapped up in a function (to make testing and use easier)

f2 <- function (data)
{
    library(dplyr)
    data %>% group_by(id, value) %>% arrange(date = as.Date(date)) %>%
        filter(any(c(abs(diff(as.Date(date))), NA) > 31) & date == min(date))
}

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Wed, Jul 16, 2014 at 7:49 AM, arun <smartpink111 at yahoo.com> wrote:
> Hi,
> If `dat` is the dataset
>
> library(dplyr)
> dat%>%
> group_by(id,value)%>%
>
> arrange(date=as.Date(date))%>%
> filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
> #Source: local data frame [3 x 3]
> #Groups: id, value
> #
> #  id       date value
> #1  a 2000-01-01     x
> #2  c 2000-09-10     y
> #3  c 2000-10-11     z
> A.K.
>
>
>
>
> On Wednesday, July 16, 2014 9:10 AM, Williams Scott <Scott.Williams at petermac.org> wrote:
> Hi R experts,
>
> I have a dataset as sampled below. Values are only regarded as Œconfirmed¹
> in an individual (Œid¹) if they occur
> more than once at least 30 days apart.
>
>
> id   date value
> a    2000-01-01 x
> a    2000-03-01 x
> b    2000-11-11 w
> c    2000-11-11 y
> c    2000-10-01 y
> c    2000-09-10 y
> c    2000-12-12 z
> c    2000-10-11 z
> d    2000-11-11 w
> d    2000-11-10 w
>
>
> I wish to subset the data to retain rows where the value for the
> individual is confirmed more than 30 days apart. So, after deleting all
> rows with just one occurrence of id and value, the rest would be the
> earliest occurrence of each value in each case id, provided 31 or more
> days exist between the dates. If >1 value is present per id, each value
> level needs to be assessed independently. This example would then reduce
> to:
>
>
> id   date           value
> a    2000-01-01 x
> c    2000-09-10 y
> c    2000-10-11 z
>
>
>
> I can do this via some crude loops and subsetting, but I am looking for as
> much efficiency as possible
> as the dataset has around 50 million rows to assess. Any suggestions
> welcomed.
>
> Thanks in advance
>
> Scott Williams MD
> Melbourne, Australia
>
>
>
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee.  If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email.  Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.