[R] "unsparse" a vector

Wed Feb 8 22:53:52 CET 2012

When compute time is important it often helps
to loop over columns instead of over rows (assuming
there are fewer columns than rows, the usual
case).  E.g., putting your code into a function f0
and the column-looping version into f1:

f0 <- function(v) {
  n <- length(v)
  d <- data.frame(A=vector(length=n,mode="integer"),
                  B=vector(length=n,mode="integer"),
                  C=vector(length=n,mode="integer"))
  l <- strsplit(gsub("(.{2})","\\1,",v),",")
  for (i in seq_along(l)) {
    l1 <- l[[i]]
    for (j in seq_along(l1)) {
      d[[substring(l1[j],1,1)]][i] <- as.integer(substring(l1[j],2,2))
    }
  }
  d
}

f1 <- function(v) {
  n <- length(v)
  letters <- c("A", "B", "C")
  names(letters) <- letters
  data.frame(lapply(letters,
              function(letter) {
                  retval <- integer(n)
                  hasLetter <- grepl(letter, v)
                  retval[hasLetter] <- as.integer(
                     gsub(sprintf("^.*%s([[:digit:]]+).*$", letter), 
                          "\\1", 
                          v[hasLetter]))
                  retval
              }))
}

I get the following times for a 10,000 long v like yours
(and the results are the same):

> vv <- rep(v, len=10000)
> system.time(r1 <- f1(vv))
   user  system elapsed 
   0.13    0.00    0.14 
> system.time(r0 <- f0(vv))
   user  system elapsed 
  10.75    0.19   10.99 
> all.equal(r0, r1)
[1] TRUE

If I double the length of v, your code takes 53 seconds (5x slower,
quadratic behavior?) while mine takes 0.17 (less than linear, suggesting
that its time is still dominated by the function call overhead for such
small input vectors).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Sam Steingold
> Sent: Wednesday, February 08, 2012 12:56 PM
> To: r-help at r-project.org
> Subject: Re: [R] "unsparse" a vector
> 
> To be clear, I can do that with nested for loops:
> 
> v <- c("A1B2","A3C4","B5","C6A7B8")
> l <- strsplit(gsub("(.{2})","\\1,",v),",")
> d <- data.frame(A=vector(length=4,mode="integer"),
>                 B=vector(length=4,mode="integer"),
>                 C=vector(length=4,mode="integer"))
> 
> for (i in 1:length(l)) {
>   l1 <- l[[i]]
>   for (j in 1:length(l1)) {
>     d[[substring(l1[j],1,1)]][i] <- as.numeric(substring(l1[j],2,2))
>   }
> }
> 
> 
> but I am afraid that handling 1,000,000 (=length(unlist(l))) strings in
> a loop will kill me.
> 
> 
> > * Sam Steingold <fqf at tah.bet> [2012-02-08 15:34:38 -0500]:
> >
> > Suppose I have a vector of strings:
> > c("A1B2","A3C4","B5","C6A7B8")
> > [1] "A1B2"   "A3C4"   "B5"     "C6A7B8"
> > where each string is a sequence of <column><value> pairs
> > (fixed width, in this example both value and name are 1 character, in
> > reality the column name is 6 chars and value is 2 digits).
> > I need to convert it to a data frame:
> > data.frame(A=c(1,3,0,7),B=c(2,0,5,8),C=c(0,4,0,6))
> >   A B C
> > 1 1 2 0
> > 2 3 0 4
> > 3 0 5 0
> > 4 7 8 6
> >
> > how do I do that?
> > thanks.
> 
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
> http://palestinefacts.org http://iris.org.il http://camera.org
> http://ffii.org http://www.PetitionOnline.com/tap12009/
> An elephant is a mouse with an operating system.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.