[R] rle with data.table - is it possible?

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Sat Jan 3 08:45:29 CET 2015


Here is what I get when I try to use your algorithm:

myf <- function( s ) {
   seg <- rep( 0, length( s ) )
   rs <- rle( s )
   span <- rs$lengths[ rs$values ]
   seg[ s ] <- rep( seq_along( span ), times = span )
   seg
}

DT <- data.table( x )
DT[ , dadseg := myf( Dad %in% c( "AA", "RR" ) ), by=Group ]
DT[ , mumseg := myf( Mum %in% c( "AA", "RR" ) ), by=Group ]
DT[ , childseg := myf( Child %in% c( "AA", "RR" ) ), by=Group ]
> DT
     Dad Mum Child Group dadseg mumseg childseg
  1:  AA  RR    RA     A      1      1        0
  2:  AA  RR    RR     A      1      1        1
  3:  AA  AA    AA     B      1      1        1
  4:  AA  AA    AA     B      1      1        1
  5:  RA  AA    RR     B      0      1        1
  6:  RR  AA    RR     B      2      1        1
  7:  AA  AA    AA     B      2      1        1
  8:  AA  AA    RA     C      1      1        0
  9:  AA  AA    RA     C      1      1        0
10:  AA  RR    RA     C      1      1        0


On Fri, 2 Jan 2015, Jeff Newmiller wrote:

> The problem is that I cannot see how your use of rle and/or seq_along 
> could possibly lead to the sample result you are giving us. That is why 
> I asked for a new example.
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>                                      Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> --------------------------------------------------------------------------- 
> Sent from my phone. Please excuse my brevity.
>
> On January 2, 2015 5:11:09 PM PST, Beejai <kate.ignatius at gmail.com> wrote:
>> Obviously this is why I need help...
>>
>> This is a larger data frame.  I'm only posting something small here to
>> make it simple.  There are many Groups which are larger, and I want to
>> assign a sequence value to consecutive rows where sumchild in not
>> equal to 0.  As the data frame I'm working with is much larger, this
>> goes up to 100 maybe even 200 and I have many different groups 20K+.
>> I would like to do this for every group, not for the whole data frame.
>>
>> There is no particular science behind this, only data organizing.
>>
>> So just say we had data like so:
>>
>>    Dad Mum Child Group sumdad summum sumchild childseg
>> 1:  AA  RR    RA     A      2      2        0        0
>> 2:  AA  RR    RR     A      2      2        1        1
>> 3:  AA  AA    AA     B      4      5        5        1
>> 4:  AA  AA    RA     B      4      5        5        0
>> 5:  RA  AA    RR     B      0      5        5        2
>> 6:  RR  AA    RR     B      4      5        5        2
>> 7:  AA  AA    AA     B      4      5        5        2
>> 8:  AA  AA    AA     C      3      3        0        1
>> 9:  AA  AA    RA     C      3      3        0        0
>> 10:  AA  RR    RR     C      3      3        0        2
>> 11:  AA  RR    RA     C     2      2        0        0
>> 12:  AA  RR    RR     C      2      2        1        3
>> 13:  AA  AA    AA     C      4      5        5        3
>> 14:  AA  AA    RA     C      4      5        5        0
>> 15:  RA  AA    RR     C      0      5        5        4
>>
>> On Fri, Jan 2, 2015 at 12:29 PM, David Winsemius [via R]
>> <ml-node+s789695n4701316h51 at n4.nabble.com> wrote:
>>>
>>> On Jan 2, 2015, at 12:07 AM, Kate Ignatius wrote:
>>>
>>>> Ah, crap.  Yep you're right.  This is not going too well. Okay - let
>>>> me try that again:
>>>>
>>>> x$childseg<-0
>>>> x<-x$sumchild !=0
>>>
>>> That previous line would appear to overwrite the entire dataframe
>> with the
>>> value of one vector
>>>
>>>> span<-rle(x)$lengths[rle(x)$values==TRUE]
>>>> x$childseg[x]<-rep(seq_along(span), times = span)
>>>>
>>>> Does this one have any errors?
>>> Even assuming that the code from Jeff Newmiller is creating those
>> objects I
>>> get
>>>
>>>> x$childseg[x]<-rep(seq_along(span), times = span)
>>> Error in `*tmp*`$childseg : $ operator is invalid for atomic vectors
>>>
>>> In the last line you are indexing a vector with a dataframe (or
>> perhaps a
>>> data.table).
>>>
>>> If we use Newmiller's object and then change some of the instances of
>> "x" in
>>> your code to DT we get:
>>>
>>>> DT$childseg<-0
>>>> x<-DT$sumchild !=0  # Try not to overwrite your data-objects
>>>> span<-rle(x)$lengths[rle(x)$values==TRUE]
>>>> DT$childseg[x]<-rep(seq_along(span), times = span)
>>>> DT
>>>     Dad Mum Child Group sumdad summum sumchild childseg
>>>  1:  AA  RR    RA     A      2      2        0        0
>>>  2:  AA  RR    RR     A      2      2        1        1
>>>  3:  AA  AA    AA     B      4      5        5        1
>>>  4:  AA  AA    AA     B      4      5        5        1
>>>  5:  RA  AA    RR     B      0      5        5        1
>>>  6:  RR  AA    RR     B      4      5        5        1
>>>  7:  AA  AA    AA     B      4      5        5        1
>>>  8:  AA  AA    RA     C      3      3        0        0
>>>  9:  AA  AA    RA     C      3      3        0        0
>>> 10:  AA  RR    RA     C      3      3        0        0
>>>
>>> You persist in posting code where you do not explain what you are
>> trying to
>>> do with it. You have already been told that your earlier efforts
>> using `rle`
>>> did not make any sense. Post a complete example and then explain what
>> you
>>> desire as an object. It's often helpful to provide a scientific
>> background
>>> for what the data represents.
>>>
>>> --
>>> David.
>>>
>>>>
>>>>
>>>> On Fri, Jan 2, 2015 at 2:32 AM, David Winsemius <[hidden email]>
>> wrote:
>>>>>
>>>>>> On Jan 1, 2015, at 5:07 PM, Kate Ignatius <[hidden email]> wrote:
>>>>>>
>>>>>> Apologies - mix up of syntax all over the place, a habit of mine. 
>> The
>>>>>> last line was in there because of code beforehand so it really
>> doesn't
>>>>>> need to be there.  Here is the proper code I hope:
>>>>>>
>>>>>> childseg<-0
>>>>>> x<-sumchild ==0
>>>>>> span<-rle(x)$lengths[rle(x)$values==TRUE]
>>>>>> childseg[x]<-rep(seq_along(span), times = span)
>>>>>>
>>>>>
>>>>> This remains not reproducible. We have no idea what sumchild might
>> be and
>>>>> the code throws an error. My guess is that you are trying to get a
>> result
>>>>> such as would be delivered by:
>>>>>
>>>>> childseg <- sumchild[ sumchild != 0 ]
>>>>>
>>>>> ?
>>>>> David.
>>>>>
>>>>>>
>>>>>> On Thu, Jan 1, 2015 at 12:13 PM, Jeff Newmiller
>>>>>> <[hidden email]> wrote:
>>>>>>> Thank you for attempting to encode what you want using R syntax,
>> but
>>>>>>> you are not really succeeding yet (too many errors). Perhaps
>> another hand
>>>>>>> generated result would help? A new input data frame might or
>> might not be
>>>>>>> needed to illustrate desired results.
>>>>>>>
>>>>>>> Your second and third lines are  syntactically incorrect, and I
>> don't
>>>>>>> understand what you hope to accomplish by assigning an empty
>> string to a
>>>>>>> numeric in your last line.
>>>>>>>
>>>>>>>
>> ---------------------------------------------------------------------------
>>>>>>> Jeff Newmiller                        The     .....       ..... 
>> Go
>>>>>>> Live...
>>>>>>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
>>>>>>>                                     Live:   OO#.. Dead: OO#.. 
>> Playing
>>>>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#. 
>> with
>>>>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>>>>> rocks...1k
>>>>>>>
>>>>>>>
>> ---------------------------------------------------------------------------
>>>>>>> Sent from my phone. Please excuse my brevity.
>>>>>>>
>>>>>>> On January 1, 2015 4:16:52 AM PST, Kate Ignatius <[hidden email]>
>>>>>>> wrote:
>>>>>>>> Is it possible to add the following code or similar in
>> data.table:
>>>>>>>>
>>>>>>>> childseg<-0
>>>>>>>> x:=sumchild <-0
>>>>>>>> span<-rle(x)$lengths[rle(x)$values==TRUE
>>>>>>>> childseg[x]<-rep(seq_along(span), times = span)
>>>>>>>> childseg[childseg == 0]<-''
>>>>>>>>
>>>>>>>> I was hoping to do this code by Group for mum, dad and
>>>>>>>> child.  The problem I'm having is with the
>>>>>>>> span<-rle(x)$lengths[rle(x)$values==TRUE line which I'm not sure
>> can
>>>>>>>> be added to data.table.
>>>>>>>>
>>>>>>>> [Previous email had incorrect code]
>>>>>>>>
>>>>>>>> On Wed, Dec 31, 2014 at 3:45 AM, Jeff Newmiller
>>>>>>>> <[hidden email]> wrote:
>>>>>>>>> I do not understand the value of using the rle function in your
>>>>>>>> description,
>>>>>>>>> but the code below appears to produce the table you want.
>>>>>>>>>
>>>>>>>>> Note that better support for the data.table package might be
>> found at
>>>>>>>>> stackexchange as the documentation specifies.
>>>>>>>>>
>>>>>>>>> x <- read.table( text=
>>>>>>>>> "Dad Mum Child Group
>>>>>>>>> AA RR RA A
>>>>>>>>> AA RR RR A
>>>>>>>>> AA AA AA B
>>>>>>>>> AA AA AA B
>>>>>>>>> RA AA RR B
>>>>>>>>> RR AA RR B
>>>>>>>>> AA AA AA B
>>>>>>>>> AA AA RA C
>>>>>>>>> AA AA RA C
>>>>>>>>> AA RR RA C
>>>>>>>>> ", header=TRUE, stringsAsFactors=FALSE )
>>>>>>>>>
>>>>>>>>> library(data.table)
>>>>>>>>> DT <- data.table( x )
>>>>>>>>> DT[ , cdad := as.integer( Dad %in% c( "AA", "RR" ) ) ]
>>>>>>>>> DT[ , sumdad := 0L ]
>>>>>>>>> DT[ 1==DT$cdad, sumdad := sum( cdad ), by=Group ]
>>>>>>>>> DT[ , cdad := NULL ]
>>>>>>>>> DT[ , cmum := as.integer( Mum %in% c( "AA", "RR" ) ) ]
>>>>>>>>> DT[ , summum := 0L ]
>>>>>>>>> DT[ 1==DT$cmum, summum := sum( cmum ), by=Group ]
>>>>>>>>> DT[ , cmum := NULL ]
>>>>>>>>> DT[ , cchild := as.integer( Child %in% c( "AA", "RR" ) ) ]
>>>>>>>>> DT[ , sumchild := 0L ]
>>>>>>>>> DT[ 1==DT$cchild, sumchild := sum( cchild ), by=Group ]
>>>>>>>>> DT[ , cchild := NULL ]
>>>>>>>>>
>>>>>>>>>> DT
>>>>>>>>>
>>>>>>>>>   Dad Mum Child Group sumdad summum sumchild
>>>>>>>>> 1:  AA  RR    RA     A      2      2        0
>>>>>>>>> 2:  AA  RR    RR     A      2      2        1
>>>>>>>>> 3:  AA  AA    AA     B      4      5        5
>>>>>>>>> 4:  AA  AA    AA     B      4      5        5
>>>>>>>>> 5:  RA  AA    RR     B      0      5        5
>>>>>>>>> 6:  RR  AA    RR     B      4      5        5
>>>>>>>>> 7:  AA  AA    AA     B      4      5        5
>>>>>>>>> 8:  AA  AA    RA     C      3      3        0
>>>>>>>>> 9:  AA  AA    RA     C      3      3        0
>>>>>>>>> 10:  AA  RR    RA     C      3      3        0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, 30 Dec 2014, Kate Ignatius wrote:
>>>>>>>>>
>>>>>>>>>> I'm trying to use both these packages and wondering whether
>> they are
>>>>>>>>>> possible...
>>>>>>>>>>
>>>>>>>>>> To make this simple, my ultimate goal is determine long
>> stretches of
>>>>>>>>>> 1s, but I want to do this within groups (hence using the
>> data.table
>>>>>>>> as
>>>>>>>>>> I use the "set key" option.  However, I'm I'm not having much
>> luck
>>>>>>>>>> making this possible.
>>>>>>>>>>
>>>>>>>>>> For example, for simplistic sake, I have the following data:
>>>>>>>>>>
>>>>>>>>>> Dad Mum Child Group
>>>>>>>>>> AA RR RA A
>>>>>>>>>> AA RR RR A
>>>>>>>>>> AA AA AA B
>>>>>>>>>> AA AA AA B
>>>>>>>>>> RA AA RR B
>>>>>>>>>> RR AA RR B
>>>>>>>>>> AA AA AA B
>>>>>>>>>> AA AA RA C
>>>>>>>>>> AA AA RA C
>>>>>>>>>> AA RR RA  C
>>>>>>>>>>
>>>>>>>>>> And the following code which I know works
>>>>>>>>>>
>>>>>>>>>> hetdad <- as.numeric(x[c(1)]=="AA" | x[c(1)]=="RR")
>>>>>>>>>> sumdad <- rle(hetdad)$lengths[rle(hetdad)$values==1]
>>>>>>>>>>
>>>>>>>>>> hetmum <- as.numeric(x[c(2)]=="AA" | x[c(2)]=="RR")
>>>>>>>>>> summum <- rle(hetmum)$lengths[rle(hetmum)$values==1]
>>>>>>>>>>
>>>>>>>>>> hetchild <- as.numeric(x[c(3)]=="AA" | x[c(3)]=="RR")
>>>>>>>>>> sumchild <- rle(hetchild)$lengths[rle(hetchild)$values==1]
>>>>>>>>>>
>>>>>>>>>> However, I wish to do the above code by Group (though this
>> file is
>>>>>>>>>> millions of rows long and groups will be larger but just
>> wanted to
>>>>>>>>>> simply the example).
>>>>>>>>>>
>>>>>>>>>> I did something like this but of course I got an error:
>>>>>>>>>>
>>>>>>>>>> LOH[,hetdad:=as.numeric(x[c(1)]=="AA" | x[c(1)]=="RR")]
>>>>>>>>>>
>> LOH[,sumdad:=rle(hetdad)$lengths[rle(hetdad)$values==1],by=Group]
>>>>>>>>>> LOH[,hetmum:=as.numeric(x[c(2)]=="AA" | x[c(2)]=="RR")]
>>>>>>>>>>
>> LOH[,summum:=rle(hetmum)$lengths[rle(hetmum)$values==1],by=Group]
>>>>>>>>>> LOH[,hetchild:=as.numeric(x[c(3)]=="AA" | x[c(3)]=="RR")]
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>> LOH[,sumchild:=rle(hetchild)$lengths[rle(hetchild)$values==1],by=Group]
>>>>>>>>>>
>>>>>>>>>> The reason being as I want to eventually have something like
>> this:
>>>>>>>>>>
>>>>>>>>>> Dad Mum Child Group sumdad summum sumchild
>>>>>>>>>> AA RR RA A 2 2 0
>>>>>>>>>> AA RR RR A 2 2 1
>>>>>>>>>> AA AA AA B 4 5 5
>>>>>>>>>> AA AA AA B 4 5 5
>>>>>>>>>> RA AA RR B 0 5 5
>>>>>>>>>> RR AA RR B 4 5 5
>>>>>>>>>> AA AA AA B 4 5 5
>>>>>>>>>> AA AA RA C 3 3 0
>>>>>>>>>> AA AA RA C 3 3 0
>>>>>>>>>> AA RR RA  C 3 3 0
>>>>>>>>>>
>>>>>>>>>> That is, I would like to have the specific counts next to what
>> I'm
>>>>>>>>>> consecutively counting per group.  So for Group A for dad
>> there are
>>>>>>>> 2
>>>>>>>>>> AAs,  there are two RRs for mum but only 1 AA or RR for the
>> child
>>>>>>>> and
>>>>>>>>>> that is RR (so the 1 is next to the RR and not the RA).
>>>>>>>>>>
>>>>>>>>>> Can this be done?
>>>>>>>>>>
>>>>>>>>>> K.
>>>>>>>>>>
>>>>>>>>>> ______________________________________________
>>>>>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
>> code.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>> ---------------------------------------------------------------------------
>>>>>>>>> Jeff Newmiller                        The     .....       .....
>> Go
>>>>>>>> Live...
>>>>>>>>> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
>>>>>>>> Go...
>>>>>>>>>                                     Live:   OO#.. Dead: OO#..
>>>>>>>> Playing
>>>>>>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>> with
>>>>>>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>>>>>> rocks...1k
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>> ---------------------------------------------------------------------------
>>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>
>>> David Winsemius
>>> Alameda, CA, USA
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>> ________________________________
>>> If you reply to this email, your message will be added to the
>> discussion
>>> below:
>>>
>> http://r.789695.n4.nabble.com/rle-with-data-table-is-it-possible-tp4701211p4701316.html
>>> To unsubscribe from rle with data.table - is it possible?, click
>> here.
>>> NAML
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/rle-with-data-table-is-it-possible-tp4701211p4701332.html
>> Sent from the R help mailing list archive at Nabble.com.
>> 	[[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k



More information about the R-help mailing list