[R] How to calculate means for multiple variables in samples with different sizes

Matthew Dowle mdowle at mdowle.plus.com
Fri Mar 11 13:53:42 CET 2011


Hi,

One liners in data.table are :

> x.dt[,lapply(.SD,mean),by=sample]
     sample replicate   height    weight      age
[1,]      A       2.0 12.20000 0.5033333 6.000000
[2,]      B       1.5 12.75000 0.7150000 4.500000
[3,]      C       2.5 11.35250 0.5125000 3.750000
[4,]      D       2.0 14.99333 0.6733333 5.333333

without the replicate column :

> x.dt[,lapply(list(height,weight,age),mean),by=sample]
     sample       V1        V2       V3
[1,]      A 12.20000 0.5033333 6.000000
[2,]      B 12.75000 0.7150000 4.500000
[3,]      C 11.35250 0.5125000 3.750000
[4,]      D 14.99333 0.6733333 5.333333

one (long) way to retain the column names :

> x.dt[,lapply(list(height=height,weight=weight,age=age),mean),by=sample]
     sample   height    weight      age
[1,]      A 12.20000 0.5033333 6.000000
[2,]      B 12.75000 0.7150000 4.500000
[3,]      C 11.35250 0.5125000 3.750000
[4,]      D 14.99333 0.6733333 5.333333
>

or this is shorter :

> ans = x.dt[,lapply(.SD,mean),by=sample]
> ans$replicate = NULL
> ans
     sample   height    weight      age
[1,]      A 12.20000 0.5033333 6.000000
[2,]      B 12.75000 0.7150000 4.500000
[3,]      C 11.35250 0.5125000 3.750000
[4,]      D 14.99333 0.6733333 5.333333
>

or another way :

> mycols = c("height","weight","age")
> x.dt[,lapply(.SD[,mycols,with=FALSE],mean),by=sample]
     sample   height    weight      age
[1,]      A 12.20000 0.5033333 6.000000
[2,]      B 12.75000 0.7150000 4.500000
[3,]      C 11.35250 0.5125000 3.750000
[4,]      D 14.99333 0.6733333 5.333333
>

or another way :

> x.dt[,lapply(.SD[,list(height,weight,age)],mean),by=sample]
     sample   height    weight      age
[1,]      A 12.20000 0.5033333 6.000000
[2,]      B 12.75000 0.7150000 4.500000
[3,]      C 11.35250 0.5125000 3.750000
[4,]      D 14.99333 0.6733333 5.333333
>

The way Jim showed :

> x.dt[, list(height = mean(height)
+            , weight = mean(weight)
+            , age = mean(age)
+            ), by = sample]

is the more flexible syntax for when you want different functions on 
different columns, easily, and as a bonus is fast.

Matthew


"Dennis Murphy" <djmuser at gmail.com> wrote in message 
news:AANLkTimxXL8BqTaYKUb=sAEE2CrA9fOSfuAp4QZkX8fe at mail.gmail.com...
> Hi:
>
> Here are a few one-liners. Calling your data frame dd,
>
> aggregate(cbind(height, weight, age) ~ sample, data = dd, FUN = mean)
>  sample   height    weight      age
> 1      A 12.20000 0.5033333 6.000000
> 2      B 12.75000 0.7150000 4.500000
> 3      C 11.35250 0.5125000 3.750000
> 4      D 14.99333 0.6733333 5.333333
>
> With package doBy:
>
> library(doBy)
> summaryBy(height + weight + age ~ sample, data = dd, FUN = mean)
>  sample height.mean weight.mean age.mean
> 1      A    12.20000   0.5033333 6.000000
> 2      B    12.75000   0.7150000 4.500000
> 3      C    11.35250   0.5125000 3.750000
> 4      D    14.99333   0.6733333 5.333333
>
> With package plyr:
>
> library(plyr)
> ddply(dd, .(sample), colwise(mean, .(height, weight, age)))
>  sample   height    weight      age
> 1      A 12.20000 0.5033333 6.000000
> 2      B 12.75000 0.7150000 4.500000
> 3      C 11.35250 0.5125000 3.750000
> 4      D 14.99333 0.6733333 5.333333
>
> Dennis
>
> On Fri, Mar 11, 2011 at 1:32 AM, Aline Santos <alinexss at gmail.com> wrote:
>
>> Hello R-helpers:
>>
>> I have data like this:
>>
>> sample    replicate    height    weight    age
>> A    1.00    12.0    0.64    6.00
>> A    2.00    12.2    0.38    6.00
>> A    3.00    12.4    0.49    6.00
>> B    1.00    12.7    0.65    4.00
>> B    2.00    12.8    0.78    5.00
>> C    1.00    11.9    0.45    6.00
>> C    2.00    11.84    0.44    2.00
>> C    3.00    11.43    0.32    3.00
>> C    4.00    10.24    0.84    4.00
>> D    1.00    14.2    0.54    2.00
>> D    2.00    15.67    0.67    7.00
>> D    3.00    15.11    0.81    7.00
>>
>> Now, how can I calculate the mean for each condition (heigth, weigth, 
>> age)
>> in each sample, considering the samples have different number of
>> replicates?
>>
>>
>> The final matrix should look like:
>>
>> sample    height    weight    age
>> A    12.20    0.50    6.00
>> B     12.75      0.72      4.50
>> C     11.35      0.51      3.75
>> D     14.99      0.67      5.33
>>
>> This is a simplified version of my dataset, which consist of 100 samples
>> (unequally distributed in 530 replicates) for 600 different conditions.
>>
>> I appreciate all the help.
>>
>> A.S.
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> [[alternative HTML version deleted]]
>



More information about the R-help mailing list