[R] slow computation of functions over large datasets
David Winsemius
dwinsemius at comcast.net
Wed Aug 3 21:46:09 CEST 2011
On Aug 3, 2011, at 3:05 PM, Ken wrote:
> Sorry about the lack of code, but using Davids example, would:
> tapply(itemPrice, INDEX=orderID, FUN=sum)
> work?
Doesn't do the cumulative sums or the assignment into column of the
same data.frame. That's why I used ave, because it keeps the sequence
correct.
--
David.
> -Ken Hutchison
>
> On Aug 3, 2554 BE, at 2:09 PM, David Winsemius
> <dwinsemius at comcast.net> wrote:
>
>>
>> On Aug 3, 2011, at 2:01 PM, Ken wrote:
>>
>>> Hello,
>>> Perhaps transpose the table attach(as.data.frame(t(data))) and use
>>> ColSums() function with order id as header.
>>> -Ken Hutchison
>>
>> Got any code? The OP offered a reproducible example, after all.
>>
>> --
>> David.
>>>
>>> On Aug 3, 2554 BE, at 1:12 PM, David Winsemius <dwinsemius at comcast.net
>>> > wrote:
>>>
>>>>
>>>> On Aug 3, 2011, at 12:20 PM, jim holtman wrote:
>>>>
>>>>> This takes about 2 secs for 1M rows:
>>>>>
>>>>>> n <- 1000000
>>>>>> exampledata <- data.frame(orderID = sample(floor(n / 5), n,
>>>>>> replace = TRUE), itemPrice = rpois(n, 10))
>>>>>> require(data.table)
>>>>>> # convert to data.table
>>>>>> ed.dt <- data.table(exampledata)
>>>>>> system.time(result <- ed.dt[
>>>>> + , list(total = sum(itemPrice))
>>>>> + , by = orderID
>>>>> + ]
>>>>> + )
>>>>> user system elapsed
>>>>> 1.30 0.05 1.34
>>>>
>>>> Interesting. Impressive. And I noted that the OP wanted what
>>>> cumsum would provide and for some reason creating that longer
>>>> result is even faster on my machine than the shorter result using
>>>> sum.
>>>>
>>>> --
>>>> David.
>>>>>>
>>>>>> str(result)
>>>>> Classes ‘data.table’ and 'data.frame': 198708 obs. of 2
>>>>> variables:
>>>>> $ orderID: int 1 2 3 4 5 6 8 9 10 11 ...
>>>>> $ total : num 49 37 72 92 50 76 34 22 65 39 ...
>>>>>> head(result)
>>>>> orderID total
>>>>> [1,] 1 49
>>>>> [2,] 2 37
>>>>> [3,] 3 72
>>>>> [4,] 4 92
>>>>> [5,] 5 50
>>>>> [6,] 6 76
>>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 3, 2011 at 9:25 AM, Caroline Faisst
>>>>> <caroline.faisst at gmail.com> wrote:
>>>>>> Hello there,
>>>>>>
>>>>>>
>>>>>> I’m computing the total value of an order from the price of the
>>>>>> order items
>>>>>> using a “for” loop and the “ifelse” function. I do this on a
>>>>>> large dataframe
>>>>>> (close to 1m lines). The computation of this function is
>>>>>> painfully slow: in
>>>>>> 1min only about 90 rows are calculated.
>>>>>>
>>>>>>
>>>>>> The computation time taken for a given number of rows increases
>>>>>> with the
>>>>>> size of the dataset, see the example with my function below:
>>>>>>
>>>>>>
>>>>>> # small dataset: function performs well
>>>>>>
>>>>>> exampledata<-
>>>>>> data
>>>>>> .frame
>>>>>> (orderID
>>>>>> =c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7))
>>>>>>
>>>>>> exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"]
>>>>>>
>>>>>> system.time(for (i in 2:length(exampledata[,1]))
>>>>>> {exampledata[i,"orderAmount"]<-
>>>>>> ifelse
>>>>>> (exampledata
>>>>>> [i
>>>>>> ,"orderID
>>>>>> "]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]
>>>>>> +exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])})
>>>>>>
>>>>>>
>>>>>> # large dataset: the very same computational task takes much
>>>>>> longer
>>>>>>
>>>>>> exampledata2<-
>>>>>> data
>>>>>> .frame
>>>>>> (orderID
>>>>>> =
>>>>>> c
>>>>>> (1,1,1,2,2,3,3,3,4,5
>>>>>> :2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020))
>>>>>>
>>>>>> exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"]
>>>>>>
>>>>>> system.time(for (i in 2:9)
>>>>>> {exampledata2[i,"orderAmount"]<-
>>>>>> ifelse
>>>>>> (exampledata2
>>>>>> [i
>>>>>> ,"orderID
>>>>>> "]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]
>>>>>> +exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])})
>>>>>>
>>>>>>
>>>>>>
>>>>>> Does someone know a way to increase the speed?
>>>>>>
>>>>>>
>>>>>> Thank you very much!
>>>>>>
>>>>>> Caroline
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible
>>>>>> code.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jim Holtman
>>>>> Data Munger Guru
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> David Winsemius, MD
>>>> West Hartford, CT
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list