[R] best way to apply a list of functions to a dataset ?
Peter Ehlers
ehlers at ucalgary.ca
Wed Jul 21 16:21:30 CEST 2010
Dennis' ddply solution would be my choice. Here is
a small variation that makes it easy to modify what
list of functions is applied:
#----
ma<- melt(attitude)
f <- function(x,v) summarise(x,
mean = mean(v),
sd = sd(v),
skewness = skewness(v),
mean.gt.med = mean.gt.med(v)
)
ddply(ma, .(variable), function(x) f(x, v = x[["value"]]))
#----
Another option is to use data.frame in place of summarise:
#----
f <- function(x,v) data.frame(
mean = mean(v),
sd = sd(v),
skewness = skewness(v),
mean.gt.med = mean.gt.med(v)
)
#----
-Peter Ehlers
On 2010-07-21 0:41, Dennis Murphy wrote:
> Hi:
>
> On Tue, Jul 20, 2010 at 5:37 PM, Glen Barnett<glnbrntt at gmail.com> wrote:
>
>> Hi Dennis,
>>
>> Thanks for the reply.
>>
>> Yes, that's easier, but the conversion to a matrix with rbind has
>> converted the output of that final function to a numeric.
>>
>
> If you look at the output of lapply(attitude, f), you'll see that the
> conversion from logical to numeric has already taken place. Different
> components of lists can have different types, but within a component, all of
> the elements must have the same class.
> You can patch up the result as follows:
>
> x<- data.frame(do.call(rbind, lapply(attitude, f)))
> x[, 5]<- as.logical(x[, 5])
>> x
> mean sd skewness median mean.gt.med
> rating 64.63333 12.172562 -0.35792491 65.5 FALSE
> complaints 66.60000 13.314757 -0.21541749 65.0 TRUE
> privileges 53.13333 12.235430 0.37912287 51.5 TRUE
> learning 56.36667 11.737013 -0.05403354 56.5 FALSE
> raises 64.63333 10.397226 0.19754317 63.5 TRUE
> critical 74.76667 9.894908 -0.86577893 77.5 FALSE
> advance 42.93333 10.288706 0.85039799 41.0 TRUE
>
> but if you're doing this sort of thing over a large data frame such fixes
> may be impractical.
>
>
> I included that last function in the example secifically to preclude
>> people assuming that functions would always return the same type.
>>
>
>
> There is a plyr solution, although it's a little more long-winded than I'd
> prefer in the end:
>
> library(ggplot2)
> # melt the data frame so that the variables become factor levels
> ma<- melt(attitude)
> Using as id variables
>> dim(ma)
> [1] 210 2
> # Use ddply to get the set of summaries by variable:
>
> ddply(ma, .(variable), summarise, mean = mean(value), sd = sd(value),
> skewness = skewness(value), median = median(value),
> mean.gt.med = mean.gt.med(value))
> variable mean sd skewness median mean.gt.med
> 1 rating 64.63333 12.172562 -0.35792491 65.5 FALSE
> 2 complaints 66.60000 13.314757 -0.21541749 65.0 TRUE
> 3 privileges 53.13333 12.235430 0.37912287 51.5 TRUE
> 4 learning 56.36667 11.737013 -0.05403354 56.5 FALSE
> 5 raises 64.63333 10.397226 0.19754317 63.5 TRUE
> 6 critical 74.76667 9.894908 -0.86577893 77.5 FALSE
> 7 advance 42.93333 10.288706 0.85039799 41.0 TRUE
>
> Notice that now the logical class of mean.gt.med is preserved. The trick
> with ddply() in package plyr is that, in this case, it is convenient to melt
> the data frame first before doing the summarizations. This is because
> ddply() requires a grouping variable - in this example, the groups are the
> variables themselves.
>
> HTH,
> Dennis
>
> I guess this doesn't matter too much for a logical, but what if
>> instead the function returned a character (say "mean", "median", or
>> "equal" - indicating which one was larger, or "equal" which could
>> easily happen with discrete data). This precludes using rbind (which I
>> also used at first, before I noticed that sometimes I could have
>> functions that don't return numerics).
>>
>> Glen
>>
>>
>> On Tue, Jul 20, 2010 at 6:55 PM, Dennis Murphy<djmuser at gmail.com> wrote:
>>> Hi:
>>>
>>> This might be a little easier (?):
>>>
>>> library(datasets)
>>> skewness<- function(x) mean(scale(x)^3)
>>> mean.gt.med<- function(x) mean(x)>median(x)
>>>
>>> # ------
>>> # construct the function to apply to each variable in the data frame
>>> f<- function(x) c(mean = mean(x), sd = sd(x), skewness = skewness(x),
>>> median = median(x), mean.gt.med = mean.gt.med(x))
>>>
>>> # map function to each variable with lapply and combine with do.call():
>>> do.call(rbind, lapply(attitude, f))
>>> mean sd skewness median mean.gt.med
>>> rating 64.63333 12.172562 -0.35792491 65.5 0
>>> complaints 66.60000 13.314757 -0.21541749 65.0 1
>>> privileges 53.13333 12.235430 0.37912287 51.5 1
>>> learning 56.36667 11.737013 -0.05403354 56.5 0
>>> raises 64.63333 10.397226 0.19754317 63.5 1
>>> critical 74.76667 9.894908 -0.86577893 77.5 0
>>> advance 42.93333 10.288706 0.85039799 41.0 1
>>>
>>> HTH,
>>> Dennis
>>>
>>>
>>> On Mon, Jul 19, 2010 at 10:51 PM, Glen Barnett<glnbrntt at gmail.com>
>> wrote:
>>>>
>>>> Assuming I have a matrix of data (or under some restrictions that will
>>>> become obvious, possibly a data frame), I want to be able to apply a
>>>> list of functions (initially producing a single number from a vector)
>>>> to the data and produce a data frame (for compact output) with column
>>>> 1 being the function results for the first function, column 2 being
>>>> the results for the second function and so on - with each row being
>>>> the columns of the original data.
>>>>
>>>> The obvious application of this is to produce summaries of data sets
>>>> (a bit like summary() does on numeric matrices), but with user
>>>> supplied functions. I am content for the moment to leave it to the
>>>> user to supply functions that work with the data they supply so as to
>>>> produce results that will actually be data-frame-able, though I'd like
>>>> to ultimately make it a bit nicer than it currently is without
>>>> compromising the niceness of the output in the "good" cases.
>>>>
>>>> The example below is a simplistic approach to this problem (it should
>>>> run as is). I have named it "fapply" for fairly obvious reasons, but
>>>> added the ".1" because it doesn't accept multidimensional arrays. I
>>>> have included the output I generated, which is what I want. There are
>>>> some obvious generalizations (e.g. being able to include functions
>>>> like range(), say, that produce several values on a vector, rather
>>>> than one, making the user's life simpler when a function already does
>>>> most of what they need).
>>>>
>>>> The question is: this looks like a silly approach, growing a list
>>>> inside a for loop. Also I recall reading that if you find yourself
>>>> using "do.call" you should probably be doing something else.
>>>>
>>>> So my question: Is there a better way to implement a function like this?
>>>>
>>>> Or, even better, is there already a function that does this?
>>>>
>>>> ## example function and code to apply a list of functions to a matrix
>>>> (here a numeric data frame)
>>>>
>>>> library(datasets)
>>>>
>>>> fapply.1<- function(x, fun.l, colnames=fun.l){
>>>> out.l<- list() # starts with an empty list
>>>> for (i in seq_along(fun.l)) out.l[[i]]<- apply(x,2,fun.l[[i]]) #
>>>> loop through list of functions
>>>>
>>>> # set up names and make into a data frame
>>>> names(out.l)<- colnames
>>>> attr(out.l,"row.names")<- names(out.l[[1]])
>>>> attr(out.l,"class")<- "data.frame"
>>>> out.l
>>>> }
>>>>
>>>> skewness<- function(x) mean(scale(x)^3) #define a simple numeric
>>>> function
>>>> mean.gt.med<- function(x) mean(x)>median(x) # define a simple
>>>> non-numeric fn
>>>> flist<- c("mean","sd","skewness","median","mean.gt.med") # make list
>>>> of fns to apply
>>>>
>>>> fapply.1(attitude,flist)
>>>> mean sd skewness median mean.gt.med
>>>> rating 64.63333 12.172562 -0.35792491 65.5 FALSE
>>>> complaints 66.60000 13.314757 -0.21541749 65.0 TRUE
>>>> privileges 53.13333 12.235430 0.37912287 51.5 TRUE
>>>> learning 56.36667 11.737013 -0.05403354 56.5 FALSE
>>>> raises 64.63333 10.397226 0.19754317 63.5 TRUE
>>>> critical 74.76667 9.894908 -0.86577893 77.5 FALSE
>>>> advance 42.93333 10.288706 0.85039799 41.0 TRUE
>>>>
>>>> ## end code and output
>>>>
>>>> So did I miss something obvious?
>>>>
>>>> Any suggestions as far as style or simple stability-enhancing
>>>> improvements would be handy.
>>>>
>>>> regards,
>>>> Glen
More information about the R-help
mailing list