[R] Create sequential vector for values in another column

William Dunlap wdunlap at tibco.com
Fri Oct 11 18:50:54 CEST 2013


At this point 3 functions have been suggested and I'll add a 4th:
  f1 <- function(x)unlist(lapply(unname(split(rep.int(1L,length(x)), x)), cumsum))
  f2 <- function(x)unlist(sapply(rle(x)$lengths, function(k) 1:k ))
  f3 <- function(x)ave(x,x,FUN=seq)
  f4 <- function(x)ave(seq_along(x), x, FUN=seq_along)
You can compare their results with ftest (as long as their results have the
same lengths):
  ftest <- function(x) {
     data.frame(x, f1=f1(x), f2=f2(x), f3=f3(x), f4=f4(x))
  }
They all return the same result for the Steven's sample data, which is numeric
and in sorted order:
  x0 <- c(123.45, 123.45, 123.45, 123.45, 234.56, 
               234.56, 234.56, 234.56, 234.56, 234.56, 234.56, 345.67, 345.67, 
               345.67, 456.78, 456.78, 456.78, 456.78, 456.78, 456.78, 456.78, 
              456.78, 456.78)
However, f1() gives the wrong answer if x is not sorted:
  > ftest(c(30,30,30, 20,20))
     x f1 f2 f3 f4
  1 30  1  1  1  1
  2 30  2  2  2  2
  3 30  1  3  3  3
  4 20  2  1  1  1
  5 20  3  2  2  2

f1() and f2() give the wrong answer if the groups are split up in the data
  > ftest(c(10,10, 8,8,8, 10,10,10)) # 10's not contiguous
     x f1 f2 f3 f4
  1 10  1  1  1  1
  2 10  2  2  2  2
  3  8  3  1  1  1
  4  8  1  2  2  2
  5  8  2  3  3  3
  6 10  3  1  3  3
  7 10  4  2  4  4
  8 10  5  3  5  5
(It is not clear what result the OP wants here.)

f3() gives the wrong answer if x is not numeric
  > f3(c("a","a","a", "b","b"))
  [1] "1" "2" "3" "1" "2"

f3() also gives an ominous warning if there is singleton in x (be
  > f3(c(1,1,1, 11))
  [1] 1 2 3 1
  Warning message:
  In `split<-.default`(`*tmp*`, g, value = lapply(split(x, g), FUN)) :
    number of items to replace is not a multiple of replacement length

f2() fails to give an answer if x is a factor
  > f2(factor(c("x","y","z")))
  Error in rle(x) : 'x' must be an atomic vector

I think f4 gives the correct result for all those cases.

I think all of the above call lapply(split()) at some point and that can use
a lot of memory when there are lots of unique values in x.  You can use
a sort-based algorithm to avoid that problem.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of arun
> Sent: Friday, October 11, 2013 6:43 AM
> To: Steven Ranney; r-help at r-project.org
> Subject: Re: [R] Create sequential vector for values in another column
> 
> 
> 
> Also,
> 
> it might be faster to use ?data.table()
> library(data.table)
>  dt1<- data.table(dat1,key='id.name')
> dt1[,x:=seq(.N),by='id.name']
> A.K.
> 
> 
> On , arun <smartpink111 at yahoo.com> wrote:
> Hi,
> Try:
> dat1<-
> 
> structure(list(id.name = c(123.45, 123.45, 123.45, 123.45, 234.56,
> 234.56, 234.56, 234.56, 234.56, 234.56, 234.56, 345.67, 345.67,
> 345.67, 456.78, 456.78, 456.78, 456.78, 456.78, 456.78, 456.78,
> 456.78, 456.78)), .Names = "id.name", class = "data.frame", row.names = c(NA,
> -23L))
> dat1$x <- with(dat1,ave(id.name,id.name,FUN=seq))
> A.K.
> 
> 
> 
> On Friday, October 11, 2013 9:28 AM, Steven Ranney <steven.ranney at gmail.com>
> wrote:
> Hello all -
> 
> I have an example column in a dataFrame
> 
> id.name
> 123.45
> 123.45
> 123.45
> 123.45
> 234.56
> 234.56
> 234.56
> 234.56
> 234.56
> 234.56
> 234.56
> 345.67
> 345.67
> 345.67
> 456.78
> 456.78
> 456.78
> 456.78
> 456.78
> 456.78
> 456.78
> 456.78
> 456.78
> ...
> [truncated]
> 
> And I'd like to create a second vector of sequential values (i.e., 1:N) for
> each unique id.name value.  In other words, I need
> 
> id.name  x
> 123.45   1
> 123.45   2
> 123.45   3
> 123.45   4
> 234.56   1
> 234.56   2
> 234.56   3
> 234.56   4
> 234.56   5
> 234.56   6
> 234.56   7
> 345.67   1
> 345.67   2
> 345.67   3
> 456.78   1
> 456.78   2
> 456.78   3
> 456.78   4
> 456.78   5
> 456.78   6
> 456.78   7
> 456.78   8
> 456.78   9
> 
> The number of unique id.name values is different; for some values, nrow()
> may be 42 and for others it may be 36, etc.
> 
> The only way I could think of to do this is with two nested for loops.  I
> tried it but because this data set is so large (nrow = 112,679 with 2,161
> unique values of id.name), it took several hours to run.
> 
> Is there an easier way to create this vector?  I'd appreciate your thoughts.
> 
> Thanks -
> 
> SR
> Steven H. Ranney
> 
>     [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list