[R] Quicker way of combining vectors into a data.frame

Marc Schwartz marc_schwartz at comcast.net
Thu Nov 30 18:25:51 CET 2006


On Thu, 2006-11-30 at 17:00 +0000, Gavin Simpson wrote:
> Hi,
> 
> In a function, I compute 10 (un-named) vectors of reasonable length
> (4471 in the particular example I have to hand) that I want to combine
> into a data frame object, that the function will return.
> 
> This is very slow, so *I'm* doing something wrong if I want it to be
> quick and efficient, though I'm not sure what the best way to do this
> would be.
> 
> I know it is the combining into data frame bit that is slow, because
> I've Rprof'ed it:
> 
> $by.self
>                         self.time self.pct total.time total.pct
> "names<-.default"           16.58     52.8      16.58      52.8
> "unlist"                     7.22     23.0       7.26      23.1
> "data.frame"                 1.72      5.5      29.38      93.6
> "duplicated.default"         1.66      5.3       1.66       5.3
> "+"                          1.20      3.8       1.20       3.8
> "list"                       0.40      1.3       0.40       1.3
> "as.data.frame.numeric"      0.28      0.9       3.32      10.6
> "apply"                      0.26      0.8       1.70       5.4
> "pmatch"                     0.22      0.7       0.22       0.7
> "paste"                      0.20      0.6       0.90       2.9
> "deparse"                    0.14      0.4       0.70       2.2
> "eval"                       0.12      0.4      31.28      99.7
> "names<-"                    0.12      0.4      16.70      53.2
> "FUN"                        0.12      0.4       1.32       4.2
> "names"                      0.12      0.4       0.14       0.4
> "as.list.default"            0.12      0.4       0.12       0.4
> "duplicated"                 0.10      0.3       1.76       5.6
> "gc"                         0.10      0.3       0.10       0.3
> 
> And I stepped through it under debug() and all the calculations before
> are quick, and then this bit takes a little over 20 seconds to complete
> 
>  fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
>                      fNupt = fNupt,
>                      rho.n = rho.n, rho.s = rho.s,
>                      net.Nimm = net.Nimm,
>                      net.Nden = net.Nden,
>                      CLminN = CLminN,
>                      CLmaxN = CLmaxN,
>                      CLmaxS = CLmaxS)
> 
> I can get it down to c. 5 seconds if I do (not Rprof'ed):
> 
>  fab <- data.frame(lc.ratio, Q,
>                      fNupt,
>                      rho.n, rho.s,
>                      net.Nimm,
>                      net.Nden,
>                      CLminN,
>                      CLmaxN,
>                      CLmaxS)
> 
> But this still seems quite a long time, so I'm thinking that there must
> be a quicker of doing what I want (end up with a data.frame with the 10
> vectors in it).
> 
> Can anyone enlighten me?

I am imputing from the above, that the 10 columns are all numeric as
there seems to be time spent in the column naming process (the lack of
which speeds up your second example), as well as the use of
as.data.frame.numeric() and related activities.

It is not clear, if this is correct, why you want a dataframe as opposed
to a numeric matrix, but in either case:

If we have 10 vectors, named Colx, where x is 1:10 and each vector is:

> str(Col1)
 num [1:4471]  0.1423  0.1873 -1.8129  0.0255 -1.7650 ...

Then:

> system.time(Mat <- cbind(Col1, Col2, Col3, Col4, Col5, Col6, Col7,
                           Col8, Col9, Col10))
[1] 0.002 0.000 0.001 0.000 0.000


Or:

> system.time(DF <- as.data.frame(cbind(Col1, Col2, Col3, Col4, Col5,
                                        Col6, Col7, Col8, Col9, Col10)))
[1] 0.005 0.000 0.005 0.000 0.000


You can then add colnames() subsequent to the cbind()ing:

> system.time(colnames(Mat) <- c("lc.ratio", "Q", "fNupt", "rho.n",
                                 "rho.s", "net.Nimm", "net.Nden",
                                 "CLminN", "CLmaxN", "CLmaxS"))
[1] 0.002 0.000 0.001 0.000 0.000
 

> system.time(colnames(DF) <- c("lc.ratio", "Q", "fNupt", "rho.n",
                                "rho.s", "net.Nimm", "net.Nden",
                                "CLminN", "CLmaxN", "CLmaxS"))
[1] 0.011 0.000 0.020 0.000 0.000



> str(Mat)
 num [1:4471, 1:10]  0.1423  0.1873 -1.8129  0.0255 -1.7650 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:10] "lc.ratio" "Q" "fNupt" "rho.n" ...

> str(DF)
'data.frame':   4471 obs. of  10 variables:
 $ lc.ratio: num   0.1423  0.1873 -1.8129  0.0255 -1.7650 ...
 $ Q       : num   0.8340 -0.2387 -0.0864 -1.1184 -0.3368 ...
 $ fNupt   : num  -0.1718 -0.0549  1.5194 -1.6127 -1.2019 ...
 $ rho.n   : num  -0.740  0.240  0.522 -1.492  1.003 ...
 $ rho.s   : num  -0.2363 -1.6248 -0.3045  0.0294  0.1240 ...
 $ net.Nimm: num  -0.774  0.947 -1.098  0.809  1.216 ...
 $ net.Nden: num  -0.198 -0.135 -0.300 -0.618 -0.784 ...
 $ CLminN  : num   0.924 -3.265  0.211  0.813  0.262 ...
 $ CLmaxN  : num   0.3212 -0.0502 -0.9978  0.9005 -1.6535 ...
 $ CLmaxS  : num  -0.520  0.278 -0.546 -0.925  1.507 ...


HTH,

Marc Schwartz



More information about the R-help mailing list