[R] Aggregate with non-scalar function

jim holtman jholtman at gmail.com
Wed Nov 7 21:35:16 CET 2007


Is this closer to what you would like?

> x <- textConnection("     hostName user sys idle             date time
+ 10142     fred  0.4 0.5 98.0 2007-11-01 02:02:18
+ 16886   barney  0.5 0.2 94.6 2007-10-25 19:12:12
+ 8795      fred  0.0 0.1 99.8 2007-10-30 05:08:22
+ 5261      fred  0.1 0.2 99.7 2007-10-25 07:20:32
+ 12427   barney  0.1 0.2 93.2 2007-10-19 14:34:10
+ 18067   barney  0.1 0.2 99.4 2007-10-27 10:34:08
+ 973       fred  0.0 0.2 99.8 2007-10-19 08:24:22
+ 5426      fred  0.2 0.3 99.5 2007-10-25 12:50:33
+ 7067      fred  0.1 0.2 99.4 2007-10-27 19:32:27
+ 13159   barney  0.1 0.4 84.3 2007-10-20 14:58:11
+ 17481   barney  1.2 2.0 92.6 2007-10-26 15:02:11
+ 21632   barney  0.1 0.1 99.6 2007-11-01 09:24:09
+ 206       fred 19.4 4.8 53.7 2007-10-18 06:50:34
+ 18151   barney  0.1 0.2 94.9 2007-10-27 13:22:09
+ 10662     fred  0.1 0.2 99.6 2007-11-01 19:22:27
+ 10376     fred  0.0 0.2 99.7 2007-11-01 09:50:24
+ 3630      fred 43.7 7.0 33.0 2007-10-23 00:58:27
+ 1118      fred  0.6 0.4 98.9 2007-10-19 13:14:23
+ 5122      fred  0.1 0.2 99.6 2007-10-25 02:42:21
+ 22117   barney  0.0 0.2 99.4 2007-11-02 01:34:04")
> x.in <- read.table(x, header=TRUE, as.is=TRUE)
> x.in$hour <- sapply(strsplit(x.in$time, ":"), '[', 1) # pick off the hour
> x.by <- by(x.in, list(x.in$hour, x.in$hostName), function(.host){
+     data.frame(hostName=.host$hostName[1], hour=.host$hour[1],
+       user.mean=mean(.host$user),
+       sys.mean=mean(.host$sys),
+       idle.mean=mean(.host$idle),
+       user.max=max(.host$user),
+       sys.max=max(.host$sys),
+       idle.max=max(.host$idle))
+ })
> do.call('rbind', x.by)
   hostName hour user.mean sys.mean idle.mean user.max sys.max idle.max
1    barney   01      0.00     0.20     99.40      0.0     0.2     99.4
2    barney   09      0.10     0.10     99.60      0.1     0.1     99.6
3    barney   10      0.10     0.20     99.40      0.1     0.2     99.4
4    barney   13      0.10     0.20     94.90      0.1     0.2     94.9
5    barney   14      0.10     0.30     88.75      0.1     0.4     93.2
6    barney   15      1.20     2.00     92.60      1.2     2.0     92.6
7    barney   19      0.50     0.20     94.60      0.5     0.2     94.6
8      fred   00     43.70     7.00     33.00     43.7     7.0     33.0
9      fred   02      0.25     0.35     98.80      0.4     0.5     99.6
10     fred   05      0.00     0.10     99.80      0.0     0.1     99.8
11     fred   06     19.40     4.80     53.70     19.4     4.8     53.7
12     fred   07      0.10     0.20     99.70      0.1     0.2     99.7
13     fred   08      0.00     0.20     99.80      0.0     0.2     99.8
14     fred   09      0.00     0.20     99.70      0.0     0.2     99.7
15     fred   12      0.20     0.30     99.50      0.2     0.3     99.5
16     fred   13      0.60     0.40     98.90      0.6     0.4     98.9
17     fred   19      0.10     0.20     99.50      0.1     0.2     99.6
>


On 11/7/07, Mike Nielsen <mr.blacksheep at gmail.com> wrote:
> R-Helpers,
>
> I'm sorry to have to ask this -- I've not used R very much in the last
> 8 or 10 months, and I've gotten rusty.
>
> I have the following (ff2 is a subset of a much, much larger dataset):
>
> > ff2
>      hostName user sys idle             obsTime
> 10142     fred  0.4 0.5 98.0 2007-11-01 02:02:18
> 16886   barney  0.5 0.2 94.6 2007-10-25 19:12:12
> 8795      fred  0.0 0.1 99.8 2007-10-30 05:08:22
> 5261      fred  0.1 0.2 99.7 2007-10-25 07:20:32
> 12427   barney  0.1 0.2 93.2 2007-10-19 14:34:10
> 18067   barney  0.1 0.2 99.4 2007-10-27 10:34:08
> 973       fred  0.0 0.2 99.8 2007-10-19 08:24:22
> 5426      fred  0.2 0.3 99.5 2007-10-25 12:50:33
> 7067      fred  0.1 0.2 99.4 2007-10-27 19:32:27
> 13159   barney  0.1 0.4 84.3 2007-10-20 14:58:11
> 17481   barney  1.2 2.0 92.6 2007-10-26 15:02:11
> 21632   barney  0.1 0.1 99.6 2007-11-01 09:24:09
> 206       fred 19.4 4.8 53.7 2007-10-18 06:50:34
> 18151   barney  0.1 0.2 94.9 2007-10-27 13:22:09
> 10662     fred  0.1 0.2 99.6 2007-11-01 19:22:27
> 10376     fred  0.0 0.2 99.7 2007-11-01 09:50:24
> 3630      fred 43.7 7.0 33.0 2007-10-23 00:58:27
> 1118      fred  0.6 0.4 98.9 2007-10-19 13:14:23
> 5122      fred  0.1 0.2 99.6 2007-10-25 02:42:21
> 22117   barney  0.0 0.2 99.4 2007-11-02 01:34:04
>
> > doit(ff2)
>   hostName hour user.mean sys.mean idle.mean user.max sys.max idle.max
> 1    barney   01      0.00     0.20     99.40      0.0     0.2     99.4
> 2    barney   09      0.10     0.10     99.60      0.1     0.1     99.6
> 3    barney   10      0.10     0.20     99.40      0.1     0.2     99.4
> 4    barney   13      0.10     0.20     94.90      0.1     0.2     94.9
> 5    barney   14      0.10     0.30     88.75      0.1     0.4     93.2
> 6    barney   15      1.20     2.00     92.60      1.2     2.0     92.6
> 7    barney   19      0.50     0.20     94.60      0.5     0.2     94.6
> 8      fred   00     43.70     7.00     33.00     43.7     7.0     33.0
> 9      fred   02      0.25     0.35     98.80      0.4     0.5     99.6
> 10     fred   05      0.00     0.10     99.80      0.0     0.1     99.8
> 11     fred   06     19.40     4.80     53.70     19.4     4.8     53.7
> 12     fred   07      0.10     0.20     99.70      0.1     0.2     99.7
> 13     fred   08      0.00     0.20     99.80      0.0     0.2     99.8
> 14     fred   09      0.00     0.20     99.70      0.0     0.2     99.7
> 15     fred   12      0.20     0.30     99.50      0.2     0.3     99.5
> 16     fred   13      0.60     0.40     98.90      0.6     0.4     98.9
> 17     fred   19      0.10     0.20     99.50      0.1     0.2     99.6
> > doit
> function(x){
> x.mean<-aggregate(x[,c("user","sys","idle")],
>                             by=list(hostName=x$hostName,
>
> hour=strftime(as.POSIXlt(x$obsTime),"%H")),
>                             mean)
>
> x.max<-aggregate(x[,c("user","sys","idle")],
>                           by=list(hostName=x$hostName,
>
> hour=strftime(as.POSIXlt(x$obsTime),"%H")),
>                           max)
>
> t1<-merge(x.mean,x.max,by=c("hostName","hour"),suffixes=c(".mean",".max"))
> return(t1)
> }
>
> The point of the "doit" function is to make a new dataframe in which
> the columns are summary statistics of certain columns in the argument.
>
> Is there a function similar to:
>
> magic.function(ff2[,c("user","system","idle")],
>      by=list(hostName=ff2$hostName,hour=strftime(as.POSIXlt(ff2$obsTime),"%H")),
>      function(x){c(mean.user=mean(x$user),
>                        mean.system=mean(x$system),
>                        mean.idle=mean(x$idle),
>                        max.user=max(x$user),
>                        max.system=max(x$system),
>                        max.idle=max(x$idle))})
>
> ie. an "aggregate" that can cope with a non-scalar function and "do
> what I mean"?  My doit function gets more and more ugly the more
> summary statistics I add, and I worry about the "merge" with hundreds
> of thousands of rows.
>
> I'm almost sure I've seen a solution to what I know is a simple
> problem, but I guess my search skills are as bad as my "R": I've
> rummaged around the r-help archives and came up with nothing to show
> for it.
>
>
> Pointers would be gratefully received.
>
> Many thanks.
> --
> Regards,
>
> Mike Nielsen
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?



More information about the R-help mailing list