[R] Problems with merge
Don MacQueen
macq at llnl.gov
Thu Oct 7 16:46:29 CEST 2004
At 10:44 AM +0530 10/6/04, Vikas Rawal wrote:
>This issue has been discussed on this list before but the solutions
>offerred are not satisfactory. So I thought I shall raise it again.
>
>I want to merge two datasets which have three common variables.
>These variables DO NOT have the same names in both the files. In
>addition, there are two variables with same name which do not
>necessarily have exactly same data. That is, there could be some
>discrepancy between the two datasets when it comes to these
>variables. I do not want them to be used when I merge the datasets.
>
>The problem is that R allows you to use by.x and by.y variables to
>specify only one variable in x dataset and one variable in y dataset
>to merge. Otherwise, if you do not specify anything, it matches all
>the variables that have common names to merge. This is very
>problemmatic. In my case, the variables I want to use to match do
>not have same names in two datasets and the ones that have same
>names must not be used to match.
>
>One approach will be to change names of variables and then merge.
>But that is not elegant, to say the least.
>
>If nothing else works, that is what I shall have to do. There again
>we have some problem. How do I change the name of a particular
>column. One solution suggested somewhere in the archives of the list
>is to use
>
>names(data.frame)=c(list of column names)
>
>But this requires you to list all the variable names. That can
>obviously be cumbersome when you have large number of variables.
>What would be the syntax if I want to change just one column name.
It's not that hard to figure out the syntax, using functions like
match(), intersect(), setdiff() and friends. Here is a suggestion:
mydf <- rename(mydf,from='oldvarname',to='newvarname')
where the rename function is this:
rename <- function (data, from = "", to = "", info = T)
{
dsn <- deparse(substitute(data))
dfn <- names(data)
if (length(from) != length(to)) {
cat("--------- from and to not same length ---------\n")
stop()
}
if (length(dfn) < length(to)) {
cat("--------- too many new names ---------\n")
stop()
}
chng <- match(from, dfn)
frm.in <- from %in% dfn
if (!all(frm.in)) {
cat("---------- some of the from names not found in",
dsn, "\n")
stop()
}
if (length(to) != length(unique(to))) {
cat("---------- New names not unique\n")
stop()
}
dfn.new <- dfn
dfn.new[chng] <- to
if (info)
cat("\nChanging in", dsn)
tmp <- rbind(from, to)
dimnames(tmp)[[1]] <- c("From:", "To:")
dimnames(tmp)[[2]] <- rep("", length(from))
if (info)
print(tmp, quote = F)
names(data) <- dfn.new
invisible(data)
}
'from' and 'to' can be character vectors, and they must be of the same length.
It wouldn't be hard to modify it to *not* receive and return the
entire dataframe, but I found it more convenient to use this way.
Also, I wrote that function a long time ago, when I had a lot less
experience than I do now (just in case anyone notices some obvious
room for improvement!)
>
>Vikas
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
More information about the R-help
mailing list