[R] How do I combine lists of data.frames into a single data frame?
Marc Schwartz
marc_schwartz at me.com
Fri Jul 16 00:27:39 CEST 2010
Ted,
Based upon your code below, you might be better off using two lapply() constructs to create the x and y results separately, taking advantage of lapply()'s built-in ability to create lists 'on the fly', while returning a NULL when the function will not be applied to the data based upon your test.
For example:
lapply(seq(n), function(i) if (test on ID[i]) funcX() else NULL)
and something like:
lapply(seq(n), function(i) if (test on ID[i]) do.call(rbind, funcY()) else NULL)
and then you can use the do.call() approach on the results of both.
Consider:
# Only return data if 'i' is even
Res1 <- lapply(1:5, function(i) if (i %% 2 == 0) iris[1:i, ] else NULL)
> Res1
[[1]]
NULL
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
[[3]]
NULL
[[4]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
[[5]]
NULL
When we use do.call() here the elements that are NULL do not result in any problems creating the result:
> do.call(rbind, Res1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 5.1 3.5 1.4 0.2 setosa
4 4.9 3.0 1.4 0.2 setosa
5 4.7 3.2 1.3 0.2 setosa
6 4.6 3.1 1.5 0.2 setosa
Now consider the second example, where your function would return a list of data frames. I'll use replicate() with 'simplify = FALSE' so that the result within lapply() is either a single list of data frames or NULL. If the result would be a list of data frames, we'll use do.call() within the loop so that lapply() returns a single data frame rather than a list of data frames. Consider:
> replicate(3, iris[1:3, ], simplify = FALSE)
[[1]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
[[3]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> do.call(rbind, replicate(3, iris[1:3, ], simplify = FALSE))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.1 3.5 1.4 0.2 setosa
5 4.9 3.0 1.4 0.2 setosa
6 4.7 3.2 1.3 0.2 setosa
7 5.1 3.5 1.4 0.2 setosa
8 4.9 3.0 1.4 0.2 setosa
9 4.7 3.2 1.3 0.2 setosa
So now:
Res2 <- lapply(1:5, function(i) if (i %% 2 == 0)
do.call(rbind, replicate(i, iris[1:i, ],
simplify = FALSE))
else NULL)
> Res2
[[1]]
NULL
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 5.1 3.5 1.4 0.2 setosa
4 4.9 3.0 1.4 0.2 setosa
[[3]]
NULL
[[4]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.1 3.5 1.4 0.2 setosa
6 4.9 3.0 1.4 0.2 setosa
7 4.7 3.2 1.3 0.2 setosa
8 4.6 3.1 1.5 0.2 setosa
9 5.1 3.5 1.4 0.2 setosa
10 4.9 3.0 1.4 0.2 setosa
11 4.7 3.2 1.3 0.2 setosa
12 4.6 3.1 1.5 0.2 setosa
13 5.1 3.5 1.4 0.2 setosa
14 4.9 3.0 1.4 0.2 setosa
15 4.7 3.2 1.3 0.2 setosa
16 4.6 3.1 1.5 0.2 setosa
[[5]]
NULL
> do.call(rbind, Res2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 5.1 3.5 1.4 0.2 setosa
4 4.9 3.0 1.4 0.2 setosa
5 5.1 3.5 1.4 0.2 setosa
6 4.9 3.0 1.4 0.2 setosa
7 4.7 3.2 1.3 0.2 setosa
8 4.6 3.1 1.5 0.2 setosa
9 5.1 3.5 1.4 0.2 setosa
10 4.9 3.0 1.4 0.2 setosa
11 4.7 3.2 1.3 0.2 setosa
12 4.6 3.1 1.5 0.2 setosa
13 5.1 3.5 1.4 0.2 setosa
14 4.9 3.0 1.4 0.2 setosa
15 4.7 3.2 1.3 0.2 setosa
16 4.6 3.1 1.5 0.2 setosa
17 5.1 3.5 1.4 0.2 setosa
18 4.9 3.0 1.4 0.2 setosa
19 4.7 3.2 1.3 0.2 setosa
20 4.6 3.1 1.5 0.2 setosa
So if you separate the two procedures, given that they are returning differing data and structures, by using lapply() you can avoid worrying about the returned data structure, as well as having to preallocate based upon not knowing how many IDs there will be. By returning NULL when the respective function will not be applied based upon your test, you can still use do.call(rbind, TheList) since those list elements that are NULL will be ignored in the result.
Does that help?
Marc
On Jul 15, 2010, at 4:32 PM, Ted Byers wrote:
> Thanks Marc
>
> Part of the challenge here is that EVERYTHING is dynamic. New data is being added to the DB all the time Each active ID makes a new sample very day or at a minimum every week, and new IDs are added every week. So I can't hard code anything. If, for a given ID, I had 50 weekly samples last week, I'll have 51 samples this week.
>
> But some for the IDs have sample sizes that are so small, it would be pure BS to try to use fitdist on their data.
>
> I have figured out a way to handle this for a given ID, and so I have the loop that iterates over the IDs, and processes the data for that ID IF there is sufficient data. And to make things interesting, the number of IDs I need to process this week is greater than the number of IDs I had to process last week.
>
> So, I iterate over IDs, from 1 up through perhaps 500. If a given ID has sufficient data, I get the z lists. And I have checked, applying rbind to these works great! Of all the IDs' datasets I have examined, perhaps 10% do not yet have enough data to work with (but that, too changes through time).
>
> From what you have said, it would seem that I ought to make a master list. So, I need to learn how to make a master list grow from nothing to include all these z lists. That reduces to a question of how can one append dynamically created lists of varying size (from just a few list elements to a few hundred list elements) to such a master list.
>
> Actually, when it gets right down to it, I think I am ignorant of a key piece of the puzzle (I have probably missed the key part of the documentation dealing with this). I do not yet know how to add even one element to a list within a loop where the loop does not exist (or at least is empty) at the beginning of the loop.
>
> I get your example "do.call(rbind, c(z1, z2, z3, z4))", but what do you do if there is no list at the beginning of a loop and you need to handle something like:
>
> #n is some large number, and in about 10% of values of 'i' (not known a priori) creation
> # of x and y is skipped
> for (i = 1:n) {
> if(test that returns tru only 90% of the time) {
> x = function_that_makes_a_data_frame()
> y = function_that_makes_a_list_of_data_frames()
> }
> }
>
> We have not created any lists on entry into the loop. How do we create a list containing all instances of x and another that contains all elements that had been in each instance of y? If I can learn how to do that, then I can call do.call(rbind,x_list) and do.call(rbind,y_element_list).
>
> If you know C++, and specifically the STL containers and algorithms, one can grow vectors or lists using a function called 'push_back' which is defined on most stl containers. I am looking for the R equivalent for objects, and the R equivalent of the C++ STL algorithm std::copy (passed the begin and end iterators of the source list and a back inserter for the recipient container), for appending a source list to a master list.
>
> Thanks
>
> Ted
More information about the R-help
mailing list