[R] merge: right set overwrite left set
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Sun Jul 12 20:35:48 CEST 2015
I get confused by your use of the position map table. If I follow your
description toward your desired result, I take a different route that
makes sense to me. Perhaps it will make sense to you as well. The key idea
is to make individual comparisons of the values for each combination of
HHid and indv, regardless of where they are in the original data frames.
Below are two different syntactic representations of my understanding of
your problem. They differ only in the approach taken to strip away syntax
clutter. I start by making the HHid identifiers character values in the
original data frames, because their respective factor levels in the two
data frames would not necessarily correspond.
x.HHu <- data.frame(
HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH10' )
, indv1 = c( 2, 0, 2 , 0, 2, 0 )
, indv2 = c( 0, NA, 2, 2, 2, 2 )
, ind3 = c( 0, 0, 0, 0, 0, 0 )
, stringsAsFactors = FALSE # avoid creating HHid as a factor
)
y.HHo <- data.frame(
HHid=c( 'HH1', 'HH2','HH5', 'HH3', 'HH10' )
, indv1 = c( 2, 0, 2, 0, NA )
, indv2 = c( 0, 2, 2, 1, 2 )
, stringsAsFactors = FALSE
)
z.map <- data.frame(
HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5'
, 'HH6','HH8', 'HH7', 'HH9', 'HH10', 'HH11' )
, position= c( 10, 20, 30, 42, 55, 66, 81, 75, 92, 101, 111 )
, stringsAsFactors = FALSE
)
# reshape2 solution
library(reshape2)
x.HHu.long <- melt( x.HHu, "HHid", variable.name = "indv" )
x.HHu.long$indv <- as.character( x.HHu.long$indv )
y.HHo.long <- melt( y.HHo, "HHid", variable.name = "indv" )
y.HHo.long$indv <- as.character( y.HHo.long$indv )
xy.HH.long <- merge( x.HHu.long
, y.HHo.long
, by = c( "HHid", "indv" )
, all = TRUE )
xy.HH.long$value <- with( xy.HH.long
, ifelse( is.na( value.y )
, value.x
, value.y ) )
xy.HH0 <- dcast( xy.HH.long, HHid ~ indv )
xy.HH <- merge( xy.HH0, z.map, all=TRUE )
xy.HH <- xy.HH[ order( xy.HH$HHid ), ]
# compare xy.HH with zzz ... I think there is an error in zzz for
# HH10/indv1, because NA should not be considered more informative
# than 0...
# tidyr/dplyr solution
library(tidyr)
library(dplyr)
# define a common processing sequence
lengthen <- ( . # period is placeholder for data frame
%>% gather( indv, value, -HHid )
%>% mutate( indv = as.character( indv ) )
)
x.HHu.dlong <- x.HHu %>% lengthen
y.HHo.dlong <- y.HHo %>% lengthen
xy.HH.d <- ( x.HHu.dlong
%>% full_join( y.HHo.dlong, by= c( "HHid", "indv" ) )
%>% transmute( HHid = HHid
, indv = indv
, value = ifelse( is.na( value.y )
, value.x
, value.y )
)
%>% spread( indv, value )
%>% full_join( z.map, by="HHid" )
%>% arrange( HHid )
%>% as.data.frame
)
On Sun, 12 Jul 2015, aldi wrote:
> Hi,
> I have two sets of data x.HHu and y.HHo, rows are IDs and columns are
> individuals. I do not know in advance indv or HHid, both of them will be
> captured from the data. As the y.HHo set updates, y.HHo set has better
> information then x.HHu set. Thus I want a merge where right set
> overwrites left set info based on HHid, i.e. to overwrite x.HHu set with
> y.HHo set but keep any extra info from the x.HHu set that is not present
> in y.HHo set.
> HHids will be complete based on z.map, with the corresponding positions.
> I am having trouble with the part after this line: ###
> ============================================+++++++++++++++++++++++++++
> I am thinking that I am creating new columns "position" "indv1" and
> "indv2", but R is interpreting them as row information.
> See the expected final table at the end. HHid is common, indv3 is from
> x.HHu, and the rest position and indv1 and indv2 are from y.HHo
> Any suggestions are appreciated.
> Thank you in advance,
> Aldi
>
> x.HHu<- data.frame(
> HHid = c( 'HH1', 'HH2', 'HH3', 'HH4', 'HH5', 'HH10')
> , indv1 = c( 2, 0, 2 , 0, 2, 0)
> , indv2 = c( 0, NA, 2, 2, 2, 2)
> , ind3 = c( 0, 0, 0, 0, 0, 0)
> )
> ### the HHo data will be the top set to overwrite any HHu data, when
> they exist, thinking that HHo are better than HHu results
> ### when they are available
>
> y.HHo<-data.frame(HHid=c('HH1', 'HH2','HH5', 'HH3', 'HH10')
> , indv1 = c(2, 0, 2, 0, NA)
> , indv2 = c(0, 2, 2, 1, 2)
> )
>
> z.map<-data.frame(HHid = c('HH1', 'HH2', 'HH3', 'HH4', 'HH5',
> 'HH6','HH8', 'HH7', 'HH9', 'HH10', 'HH11')
> , position= c(10,20,30,42,55,66,81,75,92,101,111)
> )
> ### see objects
> x.HHu
> y.HHo
> z.map
> ### now sort the map by position, this sorted map will be used to sort
> finally all data
> z.map<-z.map[with(z.map, order(position)), ]
> z.map
>
> ### First I introduce position to both sets so I can sort them in
> advance by position.
> x.HHu.map <-merge( z.map, x.HHu, by='HHid', all=T)
> x.HHu.map<-x.HHu.map[with(x.HHu.map, order(position)), ]
> x.HHu.map
>
> y.HHo.map <-merge( z.map, y.HHo, by='HHid', all= T)
> y.HHo.map<-y.HHo.map[with(y.HHo.map, order(position)), ]
> y.HHo.map
>
> ### now merge HHu and HHo with the hope to overwrite the HHu set with
> HHo wherever they overlap by column names.
> zzz <- merge(x.HHu.map, y.HHo.map, by='HHid', all=T)
> zzz
> ### find common variable names in two sets
>
> commonNames <- names(x.impu.map)[which(colnames(x.impu.map) %in%
> colnames(y.geno.map))]
>
> ## remove HHid wich is common for x and y, but work with the rest of columns
> commonNames<-commonNames[-c(1)]
>
> ### ============================================+++++++++++++++++++++++++++
> for(i in 1:length(commonNames)){
>
> print(commonNames[i])
> zzz$commonNames[i] <- NA
>
> print(paste("zzz","$",commonNames[i],".y",sep=""))
>
> zzz$commonNames[i] <- zzz[,paste(commonNames[i],".y",sep="")]
>
> ### paste(zzz$commonNames[i],".x",sep='') <- NULL;
> ### paste(zzz$commonNames[i],".y",sep='') <- NULL;
>
> }
> zzz
>
> The final expected set has to be: HHid is common, indv3 is from x.HHu,
> and the rest position and indv1 and indv2 are from y.HHo
> HHid position ind3 indv1 indv2
> 1 HH1 10 0 2 0
> 2 HH10 101 0 NA 2
> 3 HH11 111 NA NA NA
> 4 HH2 20 0 0 2
> 5 HH3 30 0 0 1
> 6 HH4 42 0 NA NA
> 7 HH5 55 0 2 2
> 8 HH6 66 NA NA NA
> 9 HH7 75 NA NA NA
> 10 HH8 81 NA NA NA
> 11 HH9 92 NA NA NA
>
> --
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
More information about the R-help
mailing list