[R] Quickly calculating the mean results over a collection of data sets?
Dan Davison
davison at stats.ox.ac.uk
Tue Aug 12 13:04:43 CEST 2008
On Tue, Aug 12, 2008 at 04:47:14AM -0400, Michael R. Head wrote:
> I have a collection of datasets in separate data frames which have 3
> independent test parameters (w, x, y) and one dependent variable (z) ,
> together with some additional static test data on each row. What I want
> is a data frame which contains the test data, the parameters (w, x, y)
> and the mean value of all (z)s in the Z column.
>
> Each datasets has around 6000 rows and around 7 columns, which doesn't
> seem outrageously large, so it seems like this shouldn't too time
> consuming, but the way I've been approaching it seems to take way too
> long (20 seconds for datasets over 4 runs, longer for my datasets over
> 10 runs).
>
> My imperative-coding brain lead me to use for loops, which seems to be
> particularly problematic for R performance. My first attempt at this
> looked like the following, which takes roughly 60 seconds to complete. I
> rewrote it a little, but the code was much longer and effectively
> replaces one of the for loops with an lapply(). I could paste the other
> code, but it's much longer and less clear about its intent.
>
Hi Michael,
> #######################
> # Start code snippet
> #######################
> ### inputFiles just a list of paths to the test runs
> testRuns <- lapply(inputFiles,
> function(x) {
> read.table(x, header=TRUE)})
(Just BTW lapply(inputFiles, read.table, header=TRUE) is slightly nicer to look at)
>
> ### W, X, Y have (small) natural values
> w <- unique(testRuns[[1]]$W)
> x <- unique(testRuns[[1]]$X)
> y <- unique(testRuns[[1]]$Y)
>
> ### All runs have the same values for all columns
> ### with the exception of the Z values, so just
> ### copy the first test run data
> testMeans <- data.frame(testRuns[[1]])
How about rbind()ing all the data frames together, and working with
the combined data frame? Say that testRuns is
> testRuns
[[1]]
W X Y Z
1 1 5 5 -0.5251156
2 5 1 3 1.1761139
3 2 4 4 -0.8934380
4 5 1 1 1.4076303
5 5 3 1 0.4679745
[[2]]
W X Y Z
1 1 5 5 -0.8556862
2 5 1 3 0.3517671
3 2 4 4 -1.0202064
4 5 1 1 1.2152349
5 5 3 1 0.4340249
> allRuns <- do.call("rbind", testRuns)
> aggregate(allRuns$Z, by=allRuns[c("W","X","Y")], mean)
W X Y x
1 5 1 1 1.3114326
2 5 3 1 0.4509997
3 5 1 3 0.7639405
4 2 4 4 -0.9568222
5 1 5 5 -0.6904009
Dan
> for(w0 in w) {
> for(y0 in y) {
> for (x0 in x) {
> row <- which(testMeans$W == w0 &
> testMeans$Y == y0 &
> testMeans$X == x0)
> meanValues <- sapply(testRuns,
> function(r)
> {mean( subset(r,
> r$W == w0 &
> r$Y == y0 &
> r$X == x0)$Z )})
> testMeans[row,]$Z = mean(meanValues)
> }
> }
> }
> ### I will then want to plot certain values over (X, Z),
> ### so ultimately, I'm going to subset the data further.
> ### Code which gives me a list of W tables with mean Z values
> ### works, too.
> #######################
> # End code snippet
> #######################
>
>
> Thanks,
> mike
>
> --
> Michael R. Head <burner at suppressingfire.org>
> http://www.cs.binghamton.edu/~mike/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
www.stats.ox.ac.uk/~davison
More information about the R-help
mailing list