[R] get latest dates for different people in a dataset

Sun Jan 25 21:38:45 CET 2015

See inline;

On 2015-01-25 20:27, William Dunlap wrote:
>  >> dLatestVisit <- dSorted[!duplicated(dSorted$__Name), ]
>  >
>  >I guess it is faster, but who knows?
>
> You can find out by making a function that generates datasets of
> various sizes and timing the suggested algorithms.  E.g.,
> makeData <-
> function(nPatients, aveVisitsPerPatient, uniqueNameDate = TRUE){
>      nrow <- trunc(nPatients * aveVisitsPerPatient)
>      patientNames <- paste0("P",seq_len(nPatients))
>      possibleDates <- as.Date(16001:17000, origin=as.Date("1970-01-01"))
>      possibleTemps <- seq(97, 103, by=0.1)
>      data <- data.frame(Name=sample(patientNames, replace=TRUE, size=nrow),
>                 CheckInDate=sample(possibleDates, replace=TRUE, size=nrow),
>                 Temp=sample(possibleTemps, replace=TRUE, size=nrow))
>      if (uniqueNameDate) {
>          data <- data[!duplicated(data[, c("Name", "CheckInDate")]), ]
>      }
>      data
> }
> funs <- list(
>      f1 = function(data) {
>          do.call(rbind, lapply(split(data, data$Name), function(x)
> x[order(x$CheckInDate),][nrow(x),]))
>      }, f2 = function (d)
>      {
>          isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
>          dSorted <- d[order(d$Name, d$CheckInDate), ]
>          dSorted[isEndOfRun(dSorted$Name), ]
>      }, f3 = function (d)
>      {
>          # is the following how you did reverse sort on date (& fwd on
> name)?

Yes; in fact I do this all the time in my applications (survival 
analysis), where I have several records for each individual.

Göran

>          #  Too bad that order's decreasing arg is not vectorized
>          dSorted <- d[order(d$Name, -as.numeric(d$CheckInDate)), ]
>          dSorted[!duplicated(dSorted$Name), ]
>      }, f4 = function(dta)
>      {
>          dta %>% group_by(Name)  %>% filter(CheckInDate==max(CheckInDate))
>      })
>
> D <- makeData(nPatients=35000, aveVisitsPerPatient=3.7) # c. 129000 visits
> library(dplyr)
> Z <- lapply(funs, function(fun){
>      time <- system.time( result <- fun(D) ) ; list(time=time,
> result=result) })
>
> sapply(Z, function(x)x$time)
> #               f1   f2   f3   f4
> #user.self  461.25 0.47 0.36 3.01
> #sys.self     1.20 0.00 0.00 0.01
> #elapsed    472.33 0.47 0.39 3.03
> #user.child     NA   NA   NA   NA
> #sys.child      NA   NA   NA   NA
>
> # duplicated is a bit better than diff, dplyr rather slower, rbind much
> slower.
>
> equivResults <- function(a, b) {
>     # results have different classes and different orders, so only check
> size and contents
>      identical(dim(a),dim(b)) && all(a[order(a$Name),]==b[order(b$Name),])
> }
> sapply(Z[-1], function(x)equivResults(x$result, Z[[1]]$result))
> #  f2   f3   f4
> #TRUE TRUE TRUE
>
> Note that the various functions give different results if any patient comes
> in twice on the same day.  f4 includes both visits in the ouput, the other
> include either the first or last (as ordered in the original file).
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
>
> On Sun, Jan 25, 2015 at 1:01 AM, Göran Broström <goran.brostrom at umu.se
> <mailto:goran.brostrom at umu.se>> wrote:
>
>     On 2015-01-24 01:14, William Dunlap wrote:
>
>         Here is one way.  Sort the data.frame, first by Name then break
>         ties with
>         CheckInDate.
>         Then choose the rows that are the last in a run of identical
>         Name values.
>
>
>     I do it by sorting by the reverse order of CheckinDate (last date
>     first) within Name, then
>
>      > dLatestVisit <- dSorted[!duplicated(dSorted$__Name), ]
>
>     I guess it is faster, but who knows?
>
>     Göran
>
>
>             txt <- "Name    CheckInDate      Temp
>
>         + John      1/3/2014              97
>         + Mary     1/3/2014              98.1
>         + Sam       1/4/2014              97.5
>         + John      1/4/2014              99"
>
>             d <- read.table(header=TRUE,
>
>         colClasses=c("character","__character","numeric"), text=txt)
>
>             d$CheckInDate <- as.Date(d$CheckInDate, as.Date,
>             format="%d/%m/%Y")
>             isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
>             dSorted <- d[order(d$Name, d$CheckInDate), ]
>             dLatestVisit <- dSorted[isEndOfRun(dSorted$__Name), ]
>             dLatestVisit
>
>             Name CheckInDate Temp
>         4 John 2014-04-01 99 <tel:2014-04-01%2099>.0
>         2 Mary 2014-03-01 98 <tel:2014-03-01%2098>.1
>         3  Sam 2014-04-01 97 <tel:2014-04-01%2097>.5
>
>
>         Bill Dunlap
>         TIBCO Software
>         wdunlap tibco.com <http://tibco.com>
>
>
>         On Fri, Jan 23, 2015 at 3:43 PM, Tan, Richard <RTan at panagora.com
>         <mailto:RTan at panagora.com>> wrote:
>
>             Hi,
>
>             Can someone help for a R question?
>
>             I have a data set like:
>
>             Name    CheckInDate      Temp
>             John      1/3/2014              97
>             Mary     1/3/2014              98.1
>             Sam       1/4/2014              97.5
>             John      1/4/2014              99
>
>             I'd like to return a dataset that for each Name, get the row
>             that is the
>             latest CheckInDate for that person.  For the example above
>             it would be
>
>             Name    CheckInDate      Temp
>             John      1/4/2014              99
>             Mary     1/3/2014              98.1
>             Sam       1/4/2014              97.5
>
>
>             Thank you for your help!
>
>             Richard
>
>
>                       [[alternative HTML version deleted]]
>
>             ________________________________________________
>             R-help at r-project.org <mailto:R-help at r-project.org> mailing
>             list -- To UNSUBSCRIBE and more, see
>             https://stat.ethz.ch/mailman/__listinfo/r-help
>             <https://stat.ethz.ch/mailman/listinfo/r-help>
>             PLEASE do read the posting guide
>             http://www.R-project.org/__posting-guide.html
>             <http://www.R-project.org/posting-guide.html>
>             and provide commented, minimal, self-contained, reproducible
>             code.
>
>
>                  [[alternative HTML version deleted]]
>
>         ________________________________________________
>         R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>         -- To UNSUBSCRIBE and more, see
>         https://stat.ethz.ch/mailman/__listinfo/r-help
>         <https://stat.ethz.ch/mailman/listinfo/r-help>
>         PLEASE do read the posting guide
>         http://www.R-project.org/__posting-guide.html
>         <http://www.R-project.org/posting-guide.html>
>         and provide commented, minimal, self-contained, reproducible code.
>
>
>     ________________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/__listinfo/r-help
>     <https://stat.ethz.ch/mailman/listinfo/r-help>
>     PLEASE do read the posting guide
>     http://www.R-project.org/__posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>     and provide commented, minimal, self-contained, reproducible code.
>
>