[R] Aggregate with non-scalar function
jim holtman
jholtman at gmail.com
Wed Nov 7 21:35:16 CET 2007
Is this closer to what you would like?
> x <- textConnection(" hostName user sys idle date time
+ 10142 fred 0.4 0.5 98.0 2007-11-01 02:02:18
+ 16886 barney 0.5 0.2 94.6 2007-10-25 19:12:12
+ 8795 fred 0.0 0.1 99.8 2007-10-30 05:08:22
+ 5261 fred 0.1 0.2 99.7 2007-10-25 07:20:32
+ 12427 barney 0.1 0.2 93.2 2007-10-19 14:34:10
+ 18067 barney 0.1 0.2 99.4 2007-10-27 10:34:08
+ 973 fred 0.0 0.2 99.8 2007-10-19 08:24:22
+ 5426 fred 0.2 0.3 99.5 2007-10-25 12:50:33
+ 7067 fred 0.1 0.2 99.4 2007-10-27 19:32:27
+ 13159 barney 0.1 0.4 84.3 2007-10-20 14:58:11
+ 17481 barney 1.2 2.0 92.6 2007-10-26 15:02:11
+ 21632 barney 0.1 0.1 99.6 2007-11-01 09:24:09
+ 206 fred 19.4 4.8 53.7 2007-10-18 06:50:34
+ 18151 barney 0.1 0.2 94.9 2007-10-27 13:22:09
+ 10662 fred 0.1 0.2 99.6 2007-11-01 19:22:27
+ 10376 fred 0.0 0.2 99.7 2007-11-01 09:50:24
+ 3630 fred 43.7 7.0 33.0 2007-10-23 00:58:27
+ 1118 fred 0.6 0.4 98.9 2007-10-19 13:14:23
+ 5122 fred 0.1 0.2 99.6 2007-10-25 02:42:21
+ 22117 barney 0.0 0.2 99.4 2007-11-02 01:34:04")
> x.in <- read.table(x, header=TRUE, as.is=TRUE)
> x.in$hour <- sapply(strsplit(x.in$time, ":"), '[', 1) # pick off the hour
> x.by <- by(x.in, list(x.in$hour, x.in$hostName), function(.host){
+ data.frame(hostName=.host$hostName[1], hour=.host$hour[1],
+ user.mean=mean(.host$user),
+ sys.mean=mean(.host$sys),
+ idle.mean=mean(.host$idle),
+ user.max=max(.host$user),
+ sys.max=max(.host$sys),
+ idle.max=max(.host$idle))
+ })
> do.call('rbind', x.by)
hostName hour user.mean sys.mean idle.mean user.max sys.max idle.max
1 barney 01 0.00 0.20 99.40 0.0 0.2 99.4
2 barney 09 0.10 0.10 99.60 0.1 0.1 99.6
3 barney 10 0.10 0.20 99.40 0.1 0.2 99.4
4 barney 13 0.10 0.20 94.90 0.1 0.2 94.9
5 barney 14 0.10 0.30 88.75 0.1 0.4 93.2
6 barney 15 1.20 2.00 92.60 1.2 2.0 92.6
7 barney 19 0.50 0.20 94.60 0.5 0.2 94.6
8 fred 00 43.70 7.00 33.00 43.7 7.0 33.0
9 fred 02 0.25 0.35 98.80 0.4 0.5 99.6
10 fred 05 0.00 0.10 99.80 0.0 0.1 99.8
11 fred 06 19.40 4.80 53.70 19.4 4.8 53.7
12 fred 07 0.10 0.20 99.70 0.1 0.2 99.7
13 fred 08 0.00 0.20 99.80 0.0 0.2 99.8
14 fred 09 0.00 0.20 99.70 0.0 0.2 99.7
15 fred 12 0.20 0.30 99.50 0.2 0.3 99.5
16 fred 13 0.60 0.40 98.90 0.6 0.4 98.9
17 fred 19 0.10 0.20 99.50 0.1 0.2 99.6
>
On 11/7/07, Mike Nielsen <mr.blacksheep at gmail.com> wrote:
> R-Helpers,
>
> I'm sorry to have to ask this -- I've not used R very much in the last
> 8 or 10 months, and I've gotten rusty.
>
> I have the following (ff2 is a subset of a much, much larger dataset):
>
> > ff2
> hostName user sys idle obsTime
> 10142 fred 0.4 0.5 98.0 2007-11-01 02:02:18
> 16886 barney 0.5 0.2 94.6 2007-10-25 19:12:12
> 8795 fred 0.0 0.1 99.8 2007-10-30 05:08:22
> 5261 fred 0.1 0.2 99.7 2007-10-25 07:20:32
> 12427 barney 0.1 0.2 93.2 2007-10-19 14:34:10
> 18067 barney 0.1 0.2 99.4 2007-10-27 10:34:08
> 973 fred 0.0 0.2 99.8 2007-10-19 08:24:22
> 5426 fred 0.2 0.3 99.5 2007-10-25 12:50:33
> 7067 fred 0.1 0.2 99.4 2007-10-27 19:32:27
> 13159 barney 0.1 0.4 84.3 2007-10-20 14:58:11
> 17481 barney 1.2 2.0 92.6 2007-10-26 15:02:11
> 21632 barney 0.1 0.1 99.6 2007-11-01 09:24:09
> 206 fred 19.4 4.8 53.7 2007-10-18 06:50:34
> 18151 barney 0.1 0.2 94.9 2007-10-27 13:22:09
> 10662 fred 0.1 0.2 99.6 2007-11-01 19:22:27
> 10376 fred 0.0 0.2 99.7 2007-11-01 09:50:24
> 3630 fred 43.7 7.0 33.0 2007-10-23 00:58:27
> 1118 fred 0.6 0.4 98.9 2007-10-19 13:14:23
> 5122 fred 0.1 0.2 99.6 2007-10-25 02:42:21
> 22117 barney 0.0 0.2 99.4 2007-11-02 01:34:04
>
> > doit(ff2)
> hostName hour user.mean sys.mean idle.mean user.max sys.max idle.max
> 1 barney 01 0.00 0.20 99.40 0.0 0.2 99.4
> 2 barney 09 0.10 0.10 99.60 0.1 0.1 99.6
> 3 barney 10 0.10 0.20 99.40 0.1 0.2 99.4
> 4 barney 13 0.10 0.20 94.90 0.1 0.2 94.9
> 5 barney 14 0.10 0.30 88.75 0.1 0.4 93.2
> 6 barney 15 1.20 2.00 92.60 1.2 2.0 92.6
> 7 barney 19 0.50 0.20 94.60 0.5 0.2 94.6
> 8 fred 00 43.70 7.00 33.00 43.7 7.0 33.0
> 9 fred 02 0.25 0.35 98.80 0.4 0.5 99.6
> 10 fred 05 0.00 0.10 99.80 0.0 0.1 99.8
> 11 fred 06 19.40 4.80 53.70 19.4 4.8 53.7
> 12 fred 07 0.10 0.20 99.70 0.1 0.2 99.7
> 13 fred 08 0.00 0.20 99.80 0.0 0.2 99.8
> 14 fred 09 0.00 0.20 99.70 0.0 0.2 99.7
> 15 fred 12 0.20 0.30 99.50 0.2 0.3 99.5
> 16 fred 13 0.60 0.40 98.90 0.6 0.4 98.9
> 17 fred 19 0.10 0.20 99.50 0.1 0.2 99.6
> > doit
> function(x){
> x.mean<-aggregate(x[,c("user","sys","idle")],
> by=list(hostName=x$hostName,
>
> hour=strftime(as.POSIXlt(x$obsTime),"%H")),
> mean)
>
> x.max<-aggregate(x[,c("user","sys","idle")],
> by=list(hostName=x$hostName,
>
> hour=strftime(as.POSIXlt(x$obsTime),"%H")),
> max)
>
> t1<-merge(x.mean,x.max,by=c("hostName","hour"),suffixes=c(".mean",".max"))
> return(t1)
> }
>
> The point of the "doit" function is to make a new dataframe in which
> the columns are summary statistics of certain columns in the argument.
>
> Is there a function similar to:
>
> magic.function(ff2[,c("user","system","idle")],
> by=list(hostName=ff2$hostName,hour=strftime(as.POSIXlt(ff2$obsTime),"%H")),
> function(x){c(mean.user=mean(x$user),
> mean.system=mean(x$system),
> mean.idle=mean(x$idle),
> max.user=max(x$user),
> max.system=max(x$system),
> max.idle=max(x$idle))})
>
> ie. an "aggregate" that can cope with a non-scalar function and "do
> what I mean"? My doit function gets more and more ugly the more
> summary statistics I add, and I worry about the "merge" with hundreds
> of thousands of rows.
>
> I'm almost sure I've seen a solution to what I know is a simple
> problem, but I guess my search skills are as bad as my "R": I've
> rummaged around the r-help archives and came up with nothing to show
> for it.
>
>
> Pointers would be gratefully received.
>
> Many thanks.
> --
> Regards,
>
> Mike Nielsen
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem you are trying to solve?
More information about the R-help
mailing list