[R] Finding unique elements faster

Mon Dec 8 21:21:36 CET 2014

Dear all, 

for the past two weeks, I've been working on a script to retrieve word pairs
and calculate some of their statistics using R. Everything seemed to work
fine until I switched from a small test dataset to the 'real thing' and
noticed what a runtime monster I had devised! 

I could reduce processing time significantly when I realized that with R, I
did not have to do everything in loops and count things vector element by
vector element, but could just have the program count everything with
tables, e.g. with 
  > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs]

However, now I seem to have run into a performance problem that I cannot
solve. I hope there's a kind soul on this list who has some advice for me.
On to the problem:

The last relic of the afore-mentioned for-loop that goes through all the
word pairs and tries to calculate some statistics on them is the following
line of code:
  > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
(where word1 and word2 are the first and second word within the two-word
sequence (all.word.pairs, above)

Here, I am trying to count the number of 'types', linguistically speaking,
before the second word in the two-word sequence (later, I am doing the same
for the first word within the sequence). The expression works, but given my
~400,000 word pairs/word1's/word2's etc, this takes quite some time. About
10 hours on my machine, in fact, since R cannot use the other three of the
four cores. Since I want to repeat the process for another 20 corpora of
similar size, I would definitely appreciate some help on this subject.

I have been trying 'typefreq.after1<-table(unique(word2[word1]))[2]' and the
subset() function and both seem to work (though I haven't checked whether
all the numbers are in fact correctly calculated), but they take about the
same amount of time. So that's no use for me. 

Does anybody have any tips to speed this up?

Thank you very much!

--
View this message in context: http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html
Sent from the R help mailing list archive at Nabble.com.