[R] get latest dates for different people in a dataset
Göran Broström
goran.brostrom at umu.se
Sun Jan 25 21:38:45 CET 2015
See inline;
On 2015-01-25 20:27, William Dunlap wrote:
> >> dLatestVisit <- dSorted[!duplicated(dSorted$__Name), ]
> >
> >I guess it is faster, but who knows?
>
> You can find out by making a function that generates datasets of
> various sizes and timing the suggested algorithms. E.g.,
> makeData <-
> function(nPatients, aveVisitsPerPatient, uniqueNameDate = TRUE){
> nrow <- trunc(nPatients * aveVisitsPerPatient)
> patientNames <- paste0("P",seq_len(nPatients))
> possibleDates <- as.Date(16001:17000, origin=as.Date("1970-01-01"))
> possibleTemps <- seq(97, 103, by=0.1)
> data <- data.frame(Name=sample(patientNames, replace=TRUE, size=nrow),
> CheckInDate=sample(possibleDates, replace=TRUE, size=nrow),
> Temp=sample(possibleTemps, replace=TRUE, size=nrow))
> if (uniqueNameDate) {
> data <- data[!duplicated(data[, c("Name", "CheckInDate")]), ]
> }
> data
> }
> funs <- list(
> f1 = function(data) {
> do.call(rbind, lapply(split(data, data$Name), function(x)
> x[order(x$CheckInDate),][nrow(x),]))
> }, f2 = function (d)
> {
> isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
> dSorted <- d[order(d$Name, d$CheckInDate), ]
> dSorted[isEndOfRun(dSorted$Name), ]
> }, f3 = function (d)
> {
> # is the following how you did reverse sort on date (& fwd on
> name)?
Yes; in fact I do this all the time in my applications (survival
analysis), where I have several records for each individual.
Göran
> # Too bad that order's decreasing arg is not vectorized
> dSorted <- d[order(d$Name, -as.numeric(d$CheckInDate)), ]
> dSorted[!duplicated(dSorted$Name), ]
> }, f4 = function(dta)
> {
> dta %>% group_by(Name) %>% filter(CheckInDate==max(CheckInDate))
> })
>
> D <- makeData(nPatients=35000, aveVisitsPerPatient=3.7) # c. 129000 visits
> library(dplyr)
> Z <- lapply(funs, function(fun){
> time <- system.time( result <- fun(D) ) ; list(time=time,
> result=result) })
>
> sapply(Z, function(x)x$time)
> # f1 f2 f3 f4
> #user.self 461.25 0.47 0.36 3.01
> #sys.self 1.20 0.00 0.00 0.01
> #elapsed 472.33 0.47 0.39 3.03
> #user.child NA NA NA NA
> #sys.child NA NA NA NA
>
> # duplicated is a bit better than diff, dplyr rather slower, rbind much
> slower.
>
> equivResults <- function(a, b) {
> # results have different classes and different orders, so only check
> size and contents
> identical(dim(a),dim(b)) && all(a[order(a$Name),]==b[order(b$Name),])
> }
> sapply(Z[-1], function(x)equivResults(x$result, Z[[1]]$result))
> # f2 f3 f4
> #TRUE TRUE TRUE
>
> Note that the various functions give different results if any patient comes
> in twice on the same day. f4 includes both visits in the ouput, the other
> include either the first or last (as ordered in the original file).
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
>
> On Sun, Jan 25, 2015 at 1:01 AM, Göran Broström <goran.brostrom at umu.se
> <mailto:goran.brostrom at umu.se>> wrote:
>
> On 2015-01-24 01:14, William Dunlap wrote:
>
> Here is one way. Sort the data.frame, first by Name then break
> ties with
> CheckInDate.
> Then choose the rows that are the last in a run of identical
> Name values.
>
>
> I do it by sorting by the reverse order of CheckinDate (last date
> first) within Name, then
>
> > dLatestVisit <- dSorted[!duplicated(dSorted$__Name), ]
>
> I guess it is faster, but who knows?
>
> Göran
>
>
> txt <- "Name CheckInDate Temp
>
> + John 1/3/2014 97
> + Mary 1/3/2014 98.1
> + Sam 1/4/2014 97.5
> + John 1/4/2014 99"
>
> d <- read.table(header=TRUE,
>
> colClasses=c("character","__character","numeric"), text=txt)
>
> d$CheckInDate <- as.Date(d$CheckInDate, as.Date,
> format="%d/%m/%Y")
> isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
> dSorted <- d[order(d$Name, d$CheckInDate), ]
> dLatestVisit <- dSorted[isEndOfRun(dSorted$__Name), ]
> dLatestVisit
>
> Name CheckInDate Temp
> 4 John 2014-04-01 99 <tel:2014-04-01%2099>.0
> 2 Mary 2014-03-01 98 <tel:2014-03-01%2098>.1
> 3 Sam 2014-04-01 97 <tel:2014-04-01%2097>.5
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
>
>
> On Fri, Jan 23, 2015 at 3:43 PM, Tan, Richard <RTan at panagora.com
> <mailto:RTan at panagora.com>> wrote:
>
> Hi,
>
> Can someone help for a R question?
>
> I have a data set like:
>
> Name CheckInDate Temp
> John 1/3/2014 97
> Mary 1/3/2014 98.1
> Sam 1/4/2014 97.5
> John 1/4/2014 99
>
> I'd like to return a dataset that for each Name, get the row
> that is the
> latest CheckInDate for that person. For the example above
> it would be
>
> Name CheckInDate Temp
> John 1/4/2014 99
> Mary 1/3/2014 98.1
> Sam 1/4/2014 97.5
>
>
> Thank you for your help!
>
> Richard
>
>
> [[alternative HTML version deleted]]
>
> ________________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing
> list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/__listinfo/r-help
> <https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide
> http://www.R-project.org/__posting-guide.html
> <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible
> code.
>
>
> [[alternative HTML version deleted]]
>
> ________________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing list
> -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/__listinfo/r-help
> <https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide
> http://www.R-project.org/__posting-guide.html
> <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ________________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
> To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/__listinfo/r-help
> <https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide
> http://www.R-project.org/__posting-guide.html
> <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list