[R] combining dataframes with different numbers of columns
Denis Chabot
chabotd at globetrotter.net
Wed Nov 8 03:30:08 CET 2006
Dear list members,
I have to combine dataframes together. However they contain different
numbers of variables. It is possible that all the variables in the
dataframe with fewer variables are contained in the dataframe with
more variables, though it is not always the case.
There are key variables identifying observations. These could be used
in a merge statement, although this won't quite work for me (see below).
I was hoping to find a way to combine dataframes where I needed only
to ensure the key variables were present. The total number of
variables in the final dataframe would be the total number of
different variables in both initial dataframes. Variables that were
absent in one dataframe would automatically get missing values in the
joint dataframe.
Here is a simple example. The initial dataframes are a and b. All
variables in b are also in a.
a <- data.frame(X=seq(1,10), matrix(runif(100, 0,15), ncol=10))
b <- data.frame(X=seq(16,20), X4=runif(5,0,15))
A merge does not work because the common variable X4 becomes 2
variables, X4.x and X4.y.
c <- merge(a,b,by="X", all=T)
This can be fixed but it requires several steps (although my solution
is probably not optimal):
names(c)[5] <- "X4"
c$X4[is.na(c$X4)] <- c$X4.y[is.na(c$X4)]
c <- c[,1:11]
One quickly becomes tired with this solution with my real-life
dataframes where different columns would require "repair" from one
case to the next.
I think I still prefer making the narrower dataframe like the wider one:
b2 <- upData(b, X1=NA, X2=NA, X3=NA, X5=NA, X6=NA, X7=NA, X8=NA,
X9=NA, X10=NA)
b2 <- b2[,c(1, 3:5, 2, 6:11)]
d <- rbind(a, b2)
But again this requires quite a bit of fine-tuning from one case to
the next in my real-life dataframes.
I suspect R has a neat way to do this and I just cannot come up with
the proper search term to find help on my own.
Or this can be automated: can one compare variable lists from 2
dataframes and add missing variables in the "narrower" dataframe?
Ideally, the solution would be able to handle the situation where the
narrower dataframe contains one or more variables that are absent
from the wider one. If this was the case, I'd like the new variable
to be present in the combined dataframe, with missing values given to
the observations from the wider dataframe.
Thanks in advance,
Denis Chabot
More information about the R-help
mailing list