[R] rle with data.table - is it possible?
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Sat Jan 3 08:45:29 CET 2015
Here is what I get when I try to use your algorithm:
myf <- function( s ) {
seg <- rep( 0, length( s ) )
rs <- rle( s )
span <- rs$lengths[ rs$values ]
seg[ s ] <- rep( seq_along( span ), times = span )
seg
}
DT <- data.table( x )
DT[ , dadseg := myf( Dad %in% c( "AA", "RR" ) ), by=Group ]
DT[ , mumseg := myf( Mum %in% c( "AA", "RR" ) ), by=Group ]
DT[ , childseg := myf( Child %in% c( "AA", "RR" ) ), by=Group ]
> DT
Dad Mum Child Group dadseg mumseg childseg
1: AA RR RA A 1 1 0
2: AA RR RR A 1 1 1
3: AA AA AA B 1 1 1
4: AA AA AA B 1 1 1
5: RA AA RR B 0 1 1
6: RR AA RR B 2 1 1
7: AA AA AA B 2 1 1
8: AA AA RA C 1 1 0
9: AA AA RA C 1 1 0
10: AA RR RA C 1 1 0
On Fri, 2 Jan 2015, Jeff Newmiller wrote:
> The problem is that I cannot see how your use of rle and/or seq_along
> could possibly lead to the sample result you are giving us. That is why
> I asked for a new example.
> ---------------------------------------------------------------------------
> Jeff Newmiller The ..... ..... Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
> Live: OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On January 2, 2015 5:11:09 PM PST, Beejai <kate.ignatius at gmail.com> wrote:
>> Obviously this is why I need help...
>>
>> This is a larger data frame. I'm only posting something small here to
>> make it simple. There are many Groups which are larger, and I want to
>> assign a sequence value to consecutive rows where sumchild in not
>> equal to 0. As the data frame I'm working with is much larger, this
>> goes up to 100 maybe even 200 and I have many different groups 20K+.
>> I would like to do this for every group, not for the whole data frame.
>>
>> There is no particular science behind this, only data organizing.
>>
>> So just say we had data like so:
>>
>> Dad Mum Child Group sumdad summum sumchild childseg
>> 1: AA RR RA A 2 2 0 0
>> 2: AA RR RR A 2 2 1 1
>> 3: AA AA AA B 4 5 5 1
>> 4: AA AA RA B 4 5 5 0
>> 5: RA AA RR B 0 5 5 2
>> 6: RR AA RR B 4 5 5 2
>> 7: AA AA AA B 4 5 5 2
>> 8: AA AA AA C 3 3 0 1
>> 9: AA AA RA C 3 3 0 0
>> 10: AA RR RR C 3 3 0 2
>> 11: AA RR RA C 2 2 0 0
>> 12: AA RR RR C 2 2 1 3
>> 13: AA AA AA C 4 5 5 3
>> 14: AA AA RA C 4 5 5 0
>> 15: RA AA RR C 0 5 5 4
>>
>> On Fri, Jan 2, 2015 at 12:29 PM, David Winsemius [via R]
>> <ml-node+s789695n4701316h51 at n4.nabble.com> wrote:
>>>
>>> On Jan 2, 2015, at 12:07 AM, Kate Ignatius wrote:
>>>
>>>> Ah, crap. Yep you're right. This is not going too well. Okay - let
>>>> me try that again:
>>>>
>>>> x$childseg<-0
>>>> x<-x$sumchild !=0
>>>
>>> That previous line would appear to overwrite the entire dataframe
>> with the
>>> value of one vector
>>>
>>>> span<-rle(x)$lengths[rle(x)$values==TRUE]
>>>> x$childseg[x]<-rep(seq_along(span), times = span)
>>>>
>>>> Does this one have any errors?
>>> Even assuming that the code from Jeff Newmiller is creating those
>> objects I
>>> get
>>>
>>>> x$childseg[x]<-rep(seq_along(span), times = span)
>>> Error in `*tmp*`$childseg : $ operator is invalid for atomic vectors
>>>
>>> In the last line you are indexing a vector with a dataframe (or
>> perhaps a
>>> data.table).
>>>
>>> If we use Newmiller's object and then change some of the instances of
>> "x" in
>>> your code to DT we get:
>>>
>>>> DT$childseg<-0
>>>> x<-DT$sumchild !=0 # Try not to overwrite your data-objects
>>>> span<-rle(x)$lengths[rle(x)$values==TRUE]
>>>> DT$childseg[x]<-rep(seq_along(span), times = span)
>>>> DT
>>> Dad Mum Child Group sumdad summum sumchild childseg
>>> 1: AA RR RA A 2 2 0 0
>>> 2: AA RR RR A 2 2 1 1
>>> 3: AA AA AA B 4 5 5 1
>>> 4: AA AA AA B 4 5 5 1
>>> 5: RA AA RR B 0 5 5 1
>>> 6: RR AA RR B 4 5 5 1
>>> 7: AA AA AA B 4 5 5 1
>>> 8: AA AA RA C 3 3 0 0
>>> 9: AA AA RA C 3 3 0 0
>>> 10: AA RR RA C 3 3 0 0
>>>
>>> You persist in posting code where you do not explain what you are
>> trying to
>>> do with it. You have already been told that your earlier efforts
>> using `rle`
>>> did not make any sense. Post a complete example and then explain what
>> you
>>> desire as an object. It's often helpful to provide a scientific
>> background
>>> for what the data represents.
>>>
>>> --
>>> David.
>>>
>>>>
>>>>
>>>> On Fri, Jan 2, 2015 at 2:32 AM, David Winsemius <[hidden email]>
>> wrote:
>>>>>
>>>>>> On Jan 1, 2015, at 5:07 PM, Kate Ignatius <[hidden email]> wrote:
>>>>>>
>>>>>> Apologies - mix up of syntax all over the place, a habit of mine.
>> The
>>>>>> last line was in there because of code beforehand so it really
>> doesn't
>>>>>> need to be there. Here is the proper code I hope:
>>>>>>
>>>>>> childseg<-0
>>>>>> x<-sumchild ==0
>>>>>> span<-rle(x)$lengths[rle(x)$values==TRUE]
>>>>>> childseg[x]<-rep(seq_along(span), times = span)
>>>>>>
>>>>>
>>>>> This remains not reproducible. We have no idea what sumchild might
>> be and
>>>>> the code throws an error. My guess is that you are trying to get a
>> result
>>>>> such as would be delivered by:
>>>>>
>>>>> childseg <- sumchild[ sumchild != 0 ]
>>>>>
>>>>> ?
>>>>> David.
>>>>>
>>>>>>
>>>>>> On Thu, Jan 1, 2015 at 12:13 PM, Jeff Newmiller
>>>>>> <[hidden email]> wrote:
>>>>>>> Thank you for attempting to encode what you want using R syntax,
>> but
>>>>>>> you are not really succeeding yet (too many errors). Perhaps
>> another hand
>>>>>>> generated result would help? A new input data frame might or
>> might not be
>>>>>>> needed to illustrate desired results.
>>>>>>>
>>>>>>> Your second and third lines are syntactically incorrect, and I
>> don't
>>>>>>> understand what you hope to accomplish by assigning an empty
>> string to a
>>>>>>> numeric in your last line.
>>>>>>>
>>>>>>>
>> ---------------------------------------------------------------------------
>>>>>>> Jeff Newmiller The ..... .....
>> Go
>>>>>>> Live...
>>>>>>> DCN:<[hidden email]> Basics: ##.#. ##.#. Live Go...
>>>>>>> Live: OO#.. Dead: OO#..
>> Playing
>>>>>>> Research Engineer (Solar/Batteries O.O#. #.O#.
>> with
>>>>>>> /Software/Embedded Controllers) .OO#. .OO#.
>>>>>>> rocks...1k
>>>>>>>
>>>>>>>
>> ---------------------------------------------------------------------------
>>>>>>> Sent from my phone. Please excuse my brevity.
>>>>>>>
>>>>>>> On January 1, 2015 4:16:52 AM PST, Kate Ignatius <[hidden email]>
>>>>>>> wrote:
>>>>>>>> Is it possible to add the following code or similar in
>> data.table:
>>>>>>>>
>>>>>>>> childseg<-0
>>>>>>>> x:=sumchild <-0
>>>>>>>> span<-rle(x)$lengths[rle(x)$values==TRUE
>>>>>>>> childseg[x]<-rep(seq_along(span), times = span)
>>>>>>>> childseg[childseg == 0]<-''
>>>>>>>>
>>>>>>>> I was hoping to do this code by Group for mum, dad and
>>>>>>>> child. The problem I'm having is with the
>>>>>>>> span<-rle(x)$lengths[rle(x)$values==TRUE line which I'm not sure
>> can
>>>>>>>> be added to data.table.
>>>>>>>>
>>>>>>>> [Previous email had incorrect code]
>>>>>>>>
>>>>>>>> On Wed, Dec 31, 2014 at 3:45 AM, Jeff Newmiller
>>>>>>>> <[hidden email]> wrote:
>>>>>>>>> I do not understand the value of using the rle function in your
>>>>>>>> description,
>>>>>>>>> but the code below appears to produce the table you want.
>>>>>>>>>
>>>>>>>>> Note that better support for the data.table package might be
>> found at
>>>>>>>>> stackexchange as the documentation specifies.
>>>>>>>>>
>>>>>>>>> x <- read.table( text=
>>>>>>>>> "Dad Mum Child Group
>>>>>>>>> AA RR RA A
>>>>>>>>> AA RR RR A
>>>>>>>>> AA AA AA B
>>>>>>>>> AA AA AA B
>>>>>>>>> RA AA RR B
>>>>>>>>> RR AA RR B
>>>>>>>>> AA AA AA B
>>>>>>>>> AA AA RA C
>>>>>>>>> AA AA RA C
>>>>>>>>> AA RR RA C
>>>>>>>>> ", header=TRUE, stringsAsFactors=FALSE )
>>>>>>>>>
>>>>>>>>> library(data.table)
>>>>>>>>> DT <- data.table( x )
>>>>>>>>> DT[ , cdad := as.integer( Dad %in% c( "AA", "RR" ) ) ]
>>>>>>>>> DT[ , sumdad := 0L ]
>>>>>>>>> DT[ 1==DT$cdad, sumdad := sum( cdad ), by=Group ]
>>>>>>>>> DT[ , cdad := NULL ]
>>>>>>>>> DT[ , cmum := as.integer( Mum %in% c( "AA", "RR" ) ) ]
>>>>>>>>> DT[ , summum := 0L ]
>>>>>>>>> DT[ 1==DT$cmum, summum := sum( cmum ), by=Group ]
>>>>>>>>> DT[ , cmum := NULL ]
>>>>>>>>> DT[ , cchild := as.integer( Child %in% c( "AA", "RR" ) ) ]
>>>>>>>>> DT[ , sumchild := 0L ]
>>>>>>>>> DT[ 1==DT$cchild, sumchild := sum( cchild ), by=Group ]
>>>>>>>>> DT[ , cchild := NULL ]
>>>>>>>>>
>>>>>>>>>> DT
>>>>>>>>>
>>>>>>>>> Dad Mum Child Group sumdad summum sumchild
>>>>>>>>> 1: AA RR RA A 2 2 0
>>>>>>>>> 2: AA RR RR A 2 2 1
>>>>>>>>> 3: AA AA AA B 4 5 5
>>>>>>>>> 4: AA AA AA B 4 5 5
>>>>>>>>> 5: RA AA RR B 0 5 5
>>>>>>>>> 6: RR AA RR B 4 5 5
>>>>>>>>> 7: AA AA AA B 4 5 5
>>>>>>>>> 8: AA AA RA C 3 3 0
>>>>>>>>> 9: AA AA RA C 3 3 0
>>>>>>>>> 10: AA RR RA C 3 3 0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, 30 Dec 2014, Kate Ignatius wrote:
>>>>>>>>>
>>>>>>>>>> I'm trying to use both these packages and wondering whether
>> they are
>>>>>>>>>> possible...
>>>>>>>>>>
>>>>>>>>>> To make this simple, my ultimate goal is determine long
>> stretches of
>>>>>>>>>> 1s, but I want to do this within groups (hence using the
>> data.table
>>>>>>>> as
>>>>>>>>>> I use the "set key" option. However, I'm I'm not having much
>> luck
>>>>>>>>>> making this possible.
>>>>>>>>>>
>>>>>>>>>> For example, for simplistic sake, I have the following data:
>>>>>>>>>>
>>>>>>>>>> Dad Mum Child Group
>>>>>>>>>> AA RR RA A
>>>>>>>>>> AA RR RR A
>>>>>>>>>> AA AA AA B
>>>>>>>>>> AA AA AA B
>>>>>>>>>> RA AA RR B
>>>>>>>>>> RR AA RR B
>>>>>>>>>> AA AA AA B
>>>>>>>>>> AA AA RA C
>>>>>>>>>> AA AA RA C
>>>>>>>>>> AA RR RA C
>>>>>>>>>>
>>>>>>>>>> And the following code which I know works
>>>>>>>>>>
>>>>>>>>>> hetdad <- as.numeric(x[c(1)]=="AA" | x[c(1)]=="RR")
>>>>>>>>>> sumdad <- rle(hetdad)$lengths[rle(hetdad)$values==1]
>>>>>>>>>>
>>>>>>>>>> hetmum <- as.numeric(x[c(2)]=="AA" | x[c(2)]=="RR")
>>>>>>>>>> summum <- rle(hetmum)$lengths[rle(hetmum)$values==1]
>>>>>>>>>>
>>>>>>>>>> hetchild <- as.numeric(x[c(3)]=="AA" | x[c(3)]=="RR")
>>>>>>>>>> sumchild <- rle(hetchild)$lengths[rle(hetchild)$values==1]
>>>>>>>>>>
>>>>>>>>>> However, I wish to do the above code by Group (though this
>> file is
>>>>>>>>>> millions of rows long and groups will be larger but just
>> wanted to
>>>>>>>>>> simply the example).
>>>>>>>>>>
>>>>>>>>>> I did something like this but of course I got an error:
>>>>>>>>>>
>>>>>>>>>> LOH[,hetdad:=as.numeric(x[c(1)]=="AA" | x[c(1)]=="RR")]
>>>>>>>>>>
>> LOH[,sumdad:=rle(hetdad)$lengths[rle(hetdad)$values==1],by=Group]
>>>>>>>>>> LOH[,hetmum:=as.numeric(x[c(2)]=="AA" | x[c(2)]=="RR")]
>>>>>>>>>>
>> LOH[,summum:=rle(hetmum)$lengths[rle(hetmum)$values==1],by=Group]
>>>>>>>>>> LOH[,hetchild:=as.numeric(x[c(3)]=="AA" | x[c(3)]=="RR")]
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>> LOH[,sumchild:=rle(hetchild)$lengths[rle(hetchild)$values==1],by=Group]
>>>>>>>>>>
>>>>>>>>>> The reason being as I want to eventually have something like
>> this:
>>>>>>>>>>
>>>>>>>>>> Dad Mum Child Group sumdad summum sumchild
>>>>>>>>>> AA RR RA A 2 2 0
>>>>>>>>>> AA RR RR A 2 2 1
>>>>>>>>>> AA AA AA B 4 5 5
>>>>>>>>>> AA AA AA B 4 5 5
>>>>>>>>>> RA AA RR B 0 5 5
>>>>>>>>>> RR AA RR B 4 5 5
>>>>>>>>>> AA AA AA B 4 5 5
>>>>>>>>>> AA AA RA C 3 3 0
>>>>>>>>>> AA AA RA C 3 3 0
>>>>>>>>>> AA RR RA C 3 3 0
>>>>>>>>>>
>>>>>>>>>> That is, I would like to have the specific counts next to what
>> I'm
>>>>>>>>>> consecutively counting per group. So for Group A for dad
>> there are
>>>>>>>> 2
>>>>>>>>>> AAs, there are two RRs for mum but only 1 AA or RR for the
>> child
>>>>>>>> and
>>>>>>>>>> that is RR (so the 1 is next to the RR and not the RA).
>>>>>>>>>>
>>>>>>>>>> Can this be done?
>>>>>>>>>>
>>>>>>>>>> K.
>>>>>>>>>>
>>>>>>>>>> ______________________________________________
>>>>>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>> and provide commented, minimal, self-contained, reproducible
>> code.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>> ---------------------------------------------------------------------------
>>>>>>>>> Jeff Newmiller The ..... .....
>> Go
>>>>>>>> Live...
>>>>>>>>> DCN:<[hidden email]> Basics: ##.#. ##.#. Live
>>>>>>>> Go...
>>>>>>>>> Live: OO#.. Dead: OO#..
>>>>>>>> Playing
>>>>>>>>> Research Engineer (Solar/Batteries O.O#. #.O#.
>> with
>>>>>>>>> /Software/Embedded Controllers) .OO#. .OO#.
>>>>>>>> rocks...1k
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>> ---------------------------------------------------------------------------
>>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>
>>> David Winsemius
>>> Alameda, CA, USA
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>> ________________________________
>>> If you reply to this email, your message will be added to the
>> discussion
>>> below:
>>>
>> http://r.789695.n4.nabble.com/rle-with-data-table-is-it-possible-tp4701211p4701316.html
>>> To unsubscribe from rle with data.table - is it possible?, click
>> here.
>>> NAML
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/rle-with-data-table-is-it-possible-tp4701211p4701332.html
>> Sent from the R help mailing list archive at Nabble.com.
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
More information about the R-help
mailing list