[R] what is the effective method to apply the below logic for ~1.2 million records in R
Ista Zahn
istazahn at gmail.com
Sun Sep 20 14:42:44 CEST 2015
Hi Ravi,
Did you try fixing the problem? What did you try and what went wrong?
The answer is probably
A <- as.data.table(A)
A[ , g15 := cumsum(ifelse(is.na(Time_Diff > 12), 0, Time_Diff > 12))]
A[ , flag_1 := 1:.N, by = c("customer", "g15")]
A[ , g15 := NULL]
but you would have learned more if you had at least tried getting
there yourself.
Best,
Ista
On Sun, Sep 20, 2015 at 6:19 AM, Ravi Teja <raviteja2504 at gmail.com> wrote:
> Hi Ista.
>
> Thanks a ton for the response and your assumptions were right.
>
> f the Time_Diff is missing then flag_1 value should be 1
> if the Time_Diff is > 12 then flag_1 value should be 1
> if the Time_Diff is < 12 the flag_1 value should be (if the current row is i
> then flag_1 value should be (flag_1[i-1] + 1) )
>
> When I tried to apply the logic you had shared, the results are deviating
> from the expected results.
>
> I think the logic you had shared will not function if there are two
> successive rows with Time_Diff values > 12
>
> I have attached a sample of my original data set and the expected flag_1
> column to this mail.
>
> Please help in tweaking your code to generate the attached result.
>
> Awaiting for your reply
>
> Thanks,
> Ravi
>
> On Sun, Sep 20, 2015 at 8:18 AM, Ista Zahn <istazahn at gmail.com> wrote:
>>
>> This assumes that the data are sorted by customer, and that only the
>> first value of Time_Diff is missing for each customer (and that the
>> first value is always missing for each customer). If those assumptions
>> hold you can do something like
>>
>> A <- read.table(text = "customer Time_Diff flag_1
>> 1 NA 1
>> 1 10 2
>> 1 8 3
>> 1 15 1
>> 1 9 2
>> 1 10 3
>> 2 NA 1
>> 2 2 2
>> 2 5 3",
>> header = TRUE)
>>
>> A$flag_1 <- NULL
>>
>> library(data.table)
>>
>> A <- as.data.table(A)
>> A[ , g15 := cumsum(c(0, ifelse(is.na(diff(Time_Diff > 12)), 0,
>> diff(Time_Diff > 12) > 0)))]
>> ## I'm not proud of the previous line, probably there is a cleaner way
>> A[ , flag_1 := 1:.N, by = c("customer", "g15")]
>> A[ , g15 := NULL]
>>
>> Best,
>> Ista
>>
>> On Sat, Sep 19, 2015 at 5:09 PM, Ravi Teja <raviteja2504 at gmail.com> wrote:
>> > Hi,
>> >
>> > I am trying to apply the below logic to generate flag_1 column on a data
>> > set consisting of ~1.2 million records in R.
>> >
>> > Code :
>> >
>> > for(i in 1: nrows)
>> > {
>> > if(A$customer[i]==A$customer[i+1])
>> > {
>> >
>> > if(is.na(A$Time_Diff[i]))
>> > A$flag_1[i] <- 1
>> > else if (A$Time_Diff[i] > 12)
>> > A$flag_1[i] <- 1
>> > else
>> > A$flag_1[i] <- A$flag_1[i-1]+1
>> >
>> > }
>> >
>> > else
>> > {
>> >
>> > if(is.na(A$Time_Diff[i]))
>> > A$flag_1[i] <- 1
>> > else if (A$Time_Diff[i] > 12)
>> > A$flag_1[i] <- 1
>> > else
>> > A$flag_1[i] <- A$flag_1[i-1]+1
>> >
>> > }
>> > }
>> >
>> >
>> > Resultant dataset should look like
>> >
>> > Customer Time_diff flag_1
>> > 1 NA 1
>> > 1 10 2
>> > 1 8 3
>> > 1 15 1
>> > 1 9 2
>> > 1 10 3
>> > 2 NA 1
>> > 2 2 2
>> > 2 5 3
>> >
>> > The above logic will take approximately 60 hours to generate the flag_1
>> > column on a dataset consisting of ~1.2 million records. Is there any
>> > effective way in R to implement this logic in R ?
>> >
>> > Appreciate your help.
>> >
>> > Thanks,
>> > Ravi
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
> --
> raviteja
More information about the R-help
mailing list