[R] Finding unique elements faster

Gheorghe Postelnicu gheorghe.postelnicu at gmail.com
Mon Dec 8 21:57:37 CET 2014

2 ideas (haven't tried them):

1. if your data is in a data frame, did you try using the by function?
Seems it would do the grouping for you.

2. Since you mention the cpu cores, you could use libraries like foreach
and %dopar% or mcapply.

I would try 1. and see if it provides a sufficient speed-up.

On Mon, Dec 8, 2014 at 9:21 PM, apeshifter <ch_koch at gmx.de> wrote:

> Dear all,
> for the past two weeks, I've been working on a script to retrieve word
> pairs
> and calculate some of their statistics using R. Everything seemed to work
> fine until I switched from a small test dataset to the 'real thing' and
> noticed what a runtime monster I had devised!
> I could reduce processing time significantly when I realized that with R, I
> did not have to do everything in loops and count things vector element by
> vector element, but could just have the program count everything with
> tables, e.g. with
>   > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs]
> However, now I seem to have run into a performance problem that I cannot
> solve. I hope there's a kind soul on this list who has some advice for me.
> On to the problem:
> The last relic of the afore-mentioned for-loop that goes through all the
> word pairs and tries to calculate some statistics on them is the following
> line of code:
>   > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
> (where word1 and word2 are the first and second word within the two-word
> sequence (all.word.pairs, above)
> Here, I am trying to count the number of 'types', linguistically speaking,
> before the second word in the two-word sequence (later, I am doing the same
> for the first word within the sequence). The expression works, but given my
> ~400,000 word pairs/word1's/word2's etc, this takes quite some time. About
> 10 hours on my machine, in fact, since R cannot use the other three of the
> four cores. Since I want to repeat the process for another 20 corpora of
> similar size, I would definitely appreciate some help on this subject.
> I have been trying 'typefreq.after1<-table(unique(word2[word1]))[2]' and
> the
> subset() function and both seem to work (though I haven't checked whether
> all the numbers are in fact correctly calculated), but they take about the
> same amount of time. So that's no use for me.
> Does anybody have any tips to speed this up?
> Thank you very much!
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html
> Sent from the R help mailing list archive at Nabble.com.
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

More information about the R-help mailing list