Christos Hatzis
christos.hatzis at nuverabio.com
Fri Aug 1 21:53:01 CEST 2008
Eleni,
A way to do this is to group the data first using 'split' and then sapply
the dist function to this list.
The slower step will be the split which took a couple of minutes on my
laptop but sapply should not take more than a minute or so.
size <- 10000
df <- data.frame(
id=rep(sample(1:100000,size=size),2),
a=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE),
b=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE),
c=sample(c(NA,rnorm(100,0,1)), size=2*size, rep=TRUE))
df$id=factor(df$id)
dfp <- split(df, df$id)
sapply(dfp, dist)
-Christos
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Eleni Rapsomaniki
> Sent: Friday, August 01, 2008 2:45 PM
> To: r-help at r-project.org
> Subject: [R] correlation between rows of data.frame
>
> Dear R users,
>
> I need to come up with an efficient method to compute the
> correlation (or at least, the euclidean distance if that's
> easier) between specific rows in a data
> frame (46,232 rows, 29 columns). The pairs of rows between
> which I want to
> find the correlation share a common value in one of the
> columns. So for example, in the following
> x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(
> NA,rnorm(10,0,1)),size=10000,
> replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000,
> replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T))
> x$id=factor(x$id)
>
> I would want to compute the correlation between the two rows
> (for cols a,b,c) that share the same id. Using a for loop and
> dist() works but takes a long time (>1 hour, my RAM is
> 1Gb):
> p=NULL
> for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])}
>
> Is there a more efficient way? I thought about apply/sapply
> etc but I don't think they'll work for rows and can't think
> of an intelligent way to make them work!
> The second problem is that I also need to know how many
> degrees of freedom (ie non missing pairs of values) were used
> in each correlation. Is there a way to also do this efficiently?
>
> I hope this makes sense! Thank you all very much in advance!
>
> Eleni
>
