[R] union data in column
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Sat Jul 24 09:23:07 CEST 2010
Fahim Md wrote:
> Is there any function/way to merge/unite the following data
>
> GENEID col1 col2 col3 col4
> G234064 1 0 0 0
> G234064 1 0 0 0
> G234064 1 0 0 0
> G234064 0 1 0 0
> G234065 0 1 0 0
> G234065 0 1 0 0
> G234065 0 1 0 0
> G234065 0 0 1 0
> G234065 0 0 1 0
> G234065 0 0 0 1
>
>
> into
> GENEID col1 col2 col3 col4
> G234064 1 1 0 0
> // 1 appears in col1 and col2 above, rest are zero
> G234065 0 1 1 1
> // 1 appears in col2 , 3 and 4 above.
>
>
> Thank
Warning on terminology: there is a "merge" function in R that lines up
rows from different tables to make a new set of longer rows (more
columns). The usual term for combining column values from multiple rows
is "aggregation".
In addition to the example offered by Jim Holtzman, here are some other
options in no particular order:
x <- read.table(textConnection(" GENEID col1 col2 col3 col4
G234064 1 0 0 0
G234064 1 0 0 0
G234064 1 0 0 0
G234064 0 1 0 0
G234065 0 1 0 0
G234065 0 1 0 0
G234065 0 1 0 0
G234065 0 0 1 0
G234065 0 0 1 0
G234065 0 0 0 1
"), header=TRUE, as.is=TRUE, row.names=NULL)
closeAllConnections()
# syntactic repackaging of Jim's basic approach
library(plyr)
ddply( x, .(GENEID), function(df)
{with(as.integer(c(col1=any(col1),col2=any(col2),col3=any(col3),col4=any(col4))))}
)
# if you are familiar with SQL, this approach allows you merge data
frames and aggregate rows in the same step, but has a somewhat limited
range of aggregating functions available
library(sqldf)
sqldf("SELECT GENEID, max(col1) AS col1, max(col2) AS col2, max(col3) AS
col3, max(col4) AS col4 FROM x GROUP BY GENEID")
# when the data you want to crunch is separate from the key(s) or you
can separate them yourself
crunch <- function(tmp){as.integer(any(tmp))}
aggregate( x[,c("col1","col2","col3","col4")], by=list(x$GENEID),
FUN="crunch" )
# this is typically used if you want to aggregate many columns with the
same set of operations
library(doBy)
summaryBy( .~GENEID, data=x, FUN=c(sum,crunch) )
More information about the R-help
mailing list