[R] Finding unique elements faster

Mon Dec 8 22:47:31 CET 2014

The data.table package might be of use to you, but lacking a reproducible 
example [1] I think I will leave figuring out just how to you.

Being on Nabble you may not be able to see the footer appended to every 
message on this MAILING LIST. For your benefit, here it is:

* R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
* https://stat.ethz.ch/mailman/listinfo/r-help
* PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
* and provide commented, minimal, self-contained, reproducible code.

[1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

On Mon, 8 Dec 2014, apeshifter wrote:

> Dear all,
>
> for the past two weeks, I've been working on a script to retrieve word pairs
> and calculate some of their statistics using R. Everything seemed to work
> fine until I switched from a small test dataset to the 'real thing' and
> noticed what a runtime monster I had devised!
>
> I could reduce processing time significantly when I realized that with R, I
> did not have to do everything in loops and count things vector element by
> vector element, but could just have the program count everything with
> tables, e.g. with
>  > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs]
>
> However, now I seem to have run into a performance problem that I cannot
> solve. I hope there's a kind soul on this list who has some advice for me.
> On to the problem:
>
> The last relic of the afore-mentioned for-loop that goes through all the
> word pairs and tries to calculate some statistics on them is the following
> line of code:
>  > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
> (where word1 and word2 are the first and second word within the two-word
> sequence (all.word.pairs, above)
>
> Here, I am trying to count the number of 'types', linguistically speaking,
> before the second word in the two-word sequence (later, I am doing the same
> for the first word within the sequence). The expression works, but given my
> ~400,000 word pairs/word1's/word2's etc, this takes quite some time. About
> 10 hours on my machine, in fact, since R cannot use the other three of the
> four cores. Since I want to repeat the process for another 20 corpora of
> similar size, I would definitely appreciate some help on this subject.
>
> I have been trying 'typefreq.after1<-table(unique(word2[word1]))[2]' and the
> subset() function and both seem to work (though I haven't checked whether
> all the numbers are in fact correctly calculated), but they take about the
> same amount of time. So that's no use for me.
>
> Does anybody have any tips to speed this up?
>
> Thank you very much!
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k