[R] Can R replicate this data manipulation in SAS?
Gabor Grothendieck
ggrothendieck at gmail.com
Fri Apr 22 18:51:32 CEST 2011
On Fri, Apr 22, 2011 at 11:27 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Since this involves time series within each id group I thought it
> would be interesting to see if this could be formulated using
> zoo series. The approach is to read it in, and convert it to a long
> form by just stacking the start and stop times in a data frame
> and converting that to zoo using the concatenation of date and id as
> the time index. It makes use of the fact that
> - zoo can use non-standard "times" (here the concatenation of date and id)
> - read.zoo can split and aggregate and
> - the resulting object can be directly cumsum'd using cumsum.zoo .
>
> Finally we convert it back to the required form.
>
> This was mostly done by looking at the output rather than trying to
> follow the SAS code so some differences are possible. It does
> seem to give the same output in this case.
>
> Yet another approach might be to use model.matrix to create the
> NRTI/NNRTI/PI columns:
> model.matrix(~ drug - 1, long2) * long2$change
> but below we stick with zoo.
>
> Some of the code here could be transplanted into Ista's solution
> (which uses the reshape and plyr packages) to get a combination
> approach.
>
> Lines <- "1004 NRTI 07/24/95 01/05/99
> 1004 NRTI 11/20/95 12/10/95
> 1004 NRTI 01/10/96 01/05/99
> 1004 PI 05/09/96 11/16/97
> 1004 NRTI 06/01/96 02/01/97
> 1004 NRTI 07/01/96 03/01/97
> 9999 PI 01/02/03 .
> 9999 NNRTI 04/05/06 07/08/09"
>
> library(zoo)
>
> # need na.fill from development version of zoo
> source("http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/zoo/R/na.fill.R?revision=831&root=zoo")
>
> # read in data and reshape to long form
>
> DF <- read.table(textConnection(Lines), as.is = TRUE, na.strings = ".",
> col.names = c("id", "drug", "date", "date"), check.names = FALSE)
> long <- rbind(cbind(DF[-4], change = 1), cbind(DF[-3], change = -1))
>
> # convert to zoo. date.id, the concatenation of date & id, becomes index.
> # Then take cumulative sums of columns and append regimen.
>
> long2 <- with(long,
> data.frame(date.id = paste(as.Date(date, "%m/%d/%y"), id), drug, change))
> z <- read.zoo(long2, index = 1, split = "drug", FUN = identity, aggregate = sum)
>
> z <- cumsum(na.fill(z, fill = 0))
> z$regimen <- ave(1:nrow(z), sub(".* ", "", time(z)), FUN = seq_along)
>
> # convert z to data frame for final processing
>
> id <- sub(".* ", "", time(z))
> start_date = as.Date(sub(" .*", "", time(z)))
> d2 <- data.frame(id, start_date, coredata(z))
>
> # from each group of rows in same id form the desired columns.
> # f processes one such group.
>
There was a bug in f. The last part above should have been:
f <- function(x) with(x, data.frame(
id = head(id, -1),
start_date = head(start_date, -1),
stop_date = start_date[-1]-1,
head(cbind(regimen, NRTI, NNRTI, PI, all = NRTI + NNRTI + PI,
HAART = as.numeric((NRTI >= 3 & NNRTI==0 & PI==0) |
(NRTI >= 2 & (NNRTI >= 1 | PI >= 1)) |
(NRTI == 1 & NNRTI >= 1 & PI >= 1))), -1)))
result <- do.call(rbind, by(d2, id, f))
rownames(result) <- NULL
result
> result
id start_date stop_date regimen NRTI NNRTI PI all HAART
1 1004 1995-07-24 1995-11-19 1 1 0 0 1 0
2 1004 1995-11-20 1995-12-09 2 2 0 0 2 0
3 1004 1995-12-10 1996-01-09 3 1 0 0 1 0
4 1004 1996-01-10 1996-05-08 4 2 0 0 2 0
5 1004 1996-05-09 1996-05-31 5 2 0 1 3 1
6 1004 1996-06-01 1996-06-30 6 3 0 1 4 1
7 1004 1996-07-01 1997-01-31 7 4 0 1 5 1
8 1004 1997-02-01 1997-02-28 8 3 0 1 4 1
9 1004 1997-03-01 1997-11-15 9 2 0 1 3 1
10 1004 1997-11-16 1999-01-04 10 2 0 0 2 0
11 9999 2003-01-02 2006-04-04 1 0 0 1 1 0
12 9999 2006-04-05 2009-07-07 2 0 1 1 2 0
13 9999 2009-07-08 <NA> 3 0 0 1 1 0
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the R-help
mailing list