[R] best way to apply a list of functions to a dataset ?
David Winsemius
dwinsemius at comcast.net
Wed Jul 21 03:10:03 CEST 2010
On Jul 20, 2010, at 8:37 PM, Glen Barnett wrote:
> Hi Dennis,
>
> Thanks for the reply.
>
> Yes, that's easier, but the conversion to a matrix with rbind has
> converted the output of that final function to a numeric.
>
> I included that last function in the example secifically to preclude
> people assuming that functions would always return the same type.
>
> I guess this doesn't matter too much for a logical, but what if
> instead the function returned a character (say "mean", "median", or
> "equal" - indicating which one was larger, or "equal" which could
> easily happen with discrete data). This precludes using rbind (which I
> also used at first, before I noticed that sometimes I could have
> functions that don't return numerics).
Have you considered the summaryBy function in package doBy? (Why
reinvent the wheel?)
--
David.
>
> Glen
>
>
> On Tue, Jul 20, 2010 at 6:55 PM, Dennis Murphy <djmuser at gmail.com>
> wrote:
>> Hi:
>>
>> This might be a little easier (?):
>>
>> library(datasets)
>> skewness <- function(x) mean(scale(x)^3)
>> mean.gt.med <- function(x) mean(x)>median(x)
>>
>> # ------
>> # construct the function to apply to each variable in the data frame
>> f <- function(x) c(mean = mean(x), sd = sd(x), skewness =
>> skewness(x),
>> median = median(x), mean.gt.med = mean.gt.med(x))
>>
>> # map function to each variable with lapply and combine with
>> do.call():
>> do.call(rbind, lapply(attitude, f))
>> mean sd skewness median mean.gt.med
>> rating 64.63333 12.172562 -0.35792491 65.5 0
>> complaints 66.60000 13.314757 -0.21541749 65.0 1
>> privileges 53.13333 12.235430 0.37912287 51.5 1
>> learning 56.36667 11.737013 -0.05403354 56.5 0
>> raises 64.63333 10.397226 0.19754317 63.5 1
>> critical 74.76667 9.894908 -0.86577893 77.5 0
>> advance 42.93333 10.288706 0.85039799 41.0 1
>>
>> HTH,
>> Dennis
>>
>>
>> On Mon, Jul 19, 2010 at 10:51 PM, Glen Barnett <glnbrntt at gmail.com>
>> wrote:
>>>
>>> Assuming I have a matrix of data (or under some restrictions that
>>> will
>>> become obvious, possibly a data frame), I want to be able to apply a
>>> list of functions (initially producing a single number from a
>>> vector)
>>> to the data and produce a data frame (for compact output) with
>>> column
>>> 1 being the function results for the first function, column 2 being
>>> the results for the second function and so on - with each row being
>>> the columns of the original data.
>>>
>>> The obvious application of this is to produce summaries of data sets
>>> (a bit like summary() does on numeric matrices), but with user
>>> supplied functions. I am content for the moment to leave it to the
>>> user to supply functions that work with the data they supply so as
>>> to
>>> produce results that will actually be data-frame-able, though I'd
>>> like
>>> to ultimately make it a bit nicer than it currently is without
>>> compromising the niceness of the output in the "good" cases.
>>>
>>> The example below is a simplistic approach to this problem (it
>>> should
>>> run as is). I have named it "fapply" for fairly obvious reasons, but
>>> added the ".1" because it doesn't accept multidimensional arrays. I
>>> have included the output I generated, which is what I want. There
>>> are
>>> some obvious generalizations (e.g. being able to include functions
>>> like range(), say, that produce several values on a vector, rather
>>> than one, making the user's life simpler when a function already
>>> does
>>> most of what they need).
>>>
>>> The question is: this looks like a silly approach, growing a list
>>> inside a for loop. Also I recall reading that if you find yourself
>>> using "do.call" you should probably be doing something else.
>>>
>>> So my question: Is there a better way to implement a function like
>>> this?
>>>
>>> Or, even better, is there already a function that does this?
>>>
>>> ## example function and code to apply a list of functions to a
>>> matrix
>>> (here a numeric data frame)
>>>
>>> library(datasets)
>>>
>>> fapply.1 <- function(x, fun.l, colnames=fun.l){
>>> out.l <- list() # starts with an empty list
>>> for (i in seq_along(fun.l)) out.l[[i]] <- apply(x,2,fun.l[[i]]) #
>>> loop through list of functions
>>>
>>> # set up names and make into a data frame
>>> names(out.l) <- colnames
>>> attr(out.l,"row.names") <- names(out.l[[1]])
>>> attr(out.l,"class") <- "data.frame"
>>> out.l
>>> }
>>>
>>> skewness <- function(x) mean(scale(x)^3) #define a simple
>>> numeric
>>> function
>>> mean.gt.med <- function(x) mean(x)>median(x) # define a simple
>>> non-numeric fn
>>> flist <- c("mean","sd","skewness","median","mean.gt.med") # make
>>> list
>>> of fns to apply
>>>
>>> fapply.1(attitude,flist)
>>> mean sd skewness median mean.gt.med
>>> rating 64.63333 12.172562 -0.35792491 65.5 FALSE
>>> complaints 66.60000 13.314757 -0.21541749 65.0 TRUE
>>> privileges 53.13333 12.235430 0.37912287 51.5 TRUE
>>> learning 56.36667 11.737013 -0.05403354 56.5 FALSE
>>> raises 64.63333 10.397226 0.19754317 63.5 TRUE
>>> critical 74.76667 9.894908 -0.86577893 77.5 FALSE
>>> advance 42.93333 10.288706 0.85039799 41.0 TRUE
>>>
>>> ## end code and output
>>>
>>> So did I miss something obvious?
>>>
>>> Any suggestions as far as style or simple stability-enhancing
>>> improvements would be handy.
>>>
>>> regards,
>>> Glen
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list