[R] Must be a better way to collate sequenced data
Petr PIKAL
petr.pikal at precheza.cz
Mon Jun 8 15:14:44 CEST 2009
Hi
"Burke, Robin" <rburke at cs.depaul.edu> napsal dne 08.06.2009 11:28:46:
> Thanks for the quick response. Sorry for being unclear with my example.
Here
> is something more concrete:
>
> user <- c(1, 2, 1, 2, 3, 1, 3, 4, 2, 3, 4, 1);
> time <- c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100,
1200);
> userCount <- c(1, 1, 2, 2, 1, 3, 2, 1, 3, 3, 2, 4);
>
> period <- 100
>
> utime.data <- data.frame(USER=user, TIME=time, USER_COUNT=userCount);
>
> The answer
>
> >utime.rcount
> TIME TIME PERC
> 1 0 0 1.4166667
> 2 1 4 1.4166667
> 3 3 9 0.9166667
> 4 6 6 0.2500000
Only partial
These code shall do what you want, however I did not check speed
utime.data$TIME <- utime.data$TIME %/% period
lll <- split(utime.data, utime.data$USER)
utime.tstart <- lapply(lll, function(x) x[1,2])
utime.tstart <- as.numeric(unlist(utime.tstart))
utime.userMax <- aggregate(utime.data["USER_COUNT"], utime.data["USER"],
max)
for ( i in 1:length(utime.tstart)) lll[[i]]["TIME"] <-
lll[[i]]["TIME"]-utime.tstart[i]
for ( i in 1:length(utime.tstart)) lll[[i]]["USER_COUNT"] <-
1/utime.userMax[i,2]
augdata <- do.call(rbind, lll)[,2:3]
utime.rcount <- aggregate(augdata, augdata["TIME"],sum)
However it probably can be improved further.
Regards
Petr
>
> I'm investigating the plyr package. I think splitting by users and
re-merging
> may do the trick, providing I can re-merge in order of the transformed
time
> value. That would avoid the costly sort operation in aggregate.
>
> Robin Burke
> Associate Professor
> School of Computer Science, Telecommunications, and
> Information Systems
> DePaul University
> (currently on leave at University College Dublin)
>
> http://josquin.cti.depaul.edu/~rburke/
>
> "The universe is made of stories, not of atoms" - Muriel Rukeyser
>
>
>
> -----Original Message-----
> From: Petr PIKAL [mailto:petr.pikal at precheza.cz]
> Sent: Monday, June 08, 2009 8:36 AM
> To: Burke, Robin
> Cc: r-help at r-project.org
> Subject: Odp: [R] Must be a better way to collate sequenced data
>
> Hi
>
> nobody has your data and so your code is irreproducible. Here are only
few
> comments
>
> augdata <<- as.data.frame(cbind(utime.atimes, utime.aperc))
>
> data.frame(utime.atimes, utime.aperc) is enough. cbinding is rather
> dangerous as it produce matrix and it has to have only one type of
values.
>
> I am a little bit puzzled by your example.
>
> u.profile<-c(50,20,10)
> u.days<-c(1,2,3)
> proc.prof<-u.profile/sum(u.profile)
> data.frame(u.days, proc.prof)
> u.days proc.prof
> 1 1 0.625
> 2 2 0.250
> 3 3 0.125
>
> OTOH you speak about normalization by max value
>
> proc.prof<-u.profile/max(u.profile)
> data.frame(u.days, proc.prof)
> u.days proc.prof
> 1 1 1.0
> 2 2 0.4
> 3 3 0.2
>
> Some suggestion which comes to my mind is to
>
> 1. Transfer time.stamp to POSIX class
> 2. Split your data according to users
> mylist <- split(data, users)
> 3. transform your data by lapply(mylist, desired transformation)
> 4. perform aggregation by days for each part of the list
> 5. reprocess list to data frame
>
> Maybe some functions from plyr or doBy library could help you.
>
> Regards
> Petr
>
>
>
>
> r-help-bounces at r-project.org napsal dne 07.06.2009 23:55:00:
>
> > I have data that looks like this
> >
> > time_stamp (seconds) user_id
> >
> > The data is (partial) ordered by time - in that sometimes transactions
> occur
> > at the same timestamp. The output I want is collated by transaction
time
> on a
> > per user basis, normalized by the maximum number of transactions per
> user, and
> > aggregated over each day. So, if the users have 50 transactions in the
> first
> > day and 20 transactions on the second day, and 10 transactions on the
> third
> > day, the output would be as follows, if each transaction represents
> 0.01% of
> > each user's total profile. (In reality, they all have different
profile
> > lengths so a transaction represents a different percentage for each
> user.)
> >
> > time_since_first_transaction (days) percent_of_profile
> > 1 0.50
> > 2 0.20
> > 3 0.10
> >
> > I have the following code that computes the right answer, but it is
> really
> > inefficient, so I'm sure that I'm doing something wrong. Really
> inefficient
> > means > 30 minutes for an 100 k item data frame on a 2.2 GHz machine,
> and my
> > 1-million data set has never finished. I'm no stranger to functional
> > programming (Lisp programmer) but I can't figure out a way to subtract
> the
> > first timestamp for user A from all of the other timestamps for user A
> without
> > either (a) building a separate table of "first entries for each user",
> which I
> > do here, or (b) re-computing the initial entry for each user with
every
> row,
> > which is what I did before and is even more inefficient. Another
killer
> > operation seems to be the aggregate step on the last line, which I use
> to
> > collate the data by days. It seems very slow, but I don't know any
other
> way
> > to do this. I realize that I am living proof that one can program in C
> no
> > matter what language one uses - so I would appreciate any
enlightenment
> on offer. If !
> > there's no better way, I'll pre-process everything in Perl, but I'd
> rather
> > learn the "R" way to do things like this. Thanks.
> >
> > # Build table of times
> > utime.times <<- utime.data["TIME"] %/% period;
> > utime.tstart <<- vector("numeric",
> length=max(utime.data["USER"]));
> > for (i in 1:nrow(utime.data))
> > {
> > if (as.numeric(utime.data[i,
> "USER_COUNT"])==1)
> > {
> > day <- utime.times[i,
> "TIME"];
> > user <- utime.data[i,
> "USER"];
> > utime.tstart[user] <<-
> day;
> > }
> > }
> >
> > # Build table of maximum profile sizes
> > utime.userMax <<- aggregate(utime.data["USER_COUNT"],
> > utime.data["USER"],
> > max);
> >
> > utime.atimes <<- vector("numeric",
> length=nrow(utime.data));
> > utime.aperc <<- vector("numeric",
> length=nrow(utime.data));
> > augdata <<- as.data.frame(cbind(utime.atimes,
> utime.aperc));
> > names(augdata) <<- c("TIME", "PERC");
> > for (i in 1:nrow(utime.data))
> > {
> > # adjust time according to user start
> time
> > augdata[i, "TIME"] <<-
> > utime.times[i,"TIME"]
-
> > utime.tstart[utime.data[i,"USER"]];
> > # look up maximum user count
> > umax <- subset(utime.userMax,
> >
> > USER==as.numeric(utime.data[i, "USER"]))["USER_COUNT"];
> > augdata[i, "PERC"] <<- 1.0/umax;
> > }
> >
> > utime.rcount <<- aggregate(augdata, augdata["TIME"],
> sum);
> > ....
> >
> >
> > Robin Burke
> > Associate Professor
> > School of Computer Science, Telecommunications, and
> > Information Systems
> > DePaul University
> > (currently on leave at University College Dublin)
> >
> > http://josquin.cti.depaul.edu/~rburke/
> >
> > "The universe is made of stories, not of atoms" - Muriel Rukeyser
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list