[R] Finding unique elements faster
Stefan Evert
stefanML at collocations.de
Mon Dec 8 23:16:23 CET 2014
On 8 Dec 2014, at 21:21, apeshifter <ch_koch at gmx.de> wrote:
> The last relic of the afore-mentioned for-loop that goes through all the
> word pairs and tries to calculate some statistics on them is the following
> line of code:
>> typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
> (where word1 and word2 are the first and second word within the two-word
> sequence (all.word.pairs, above)
It is difficult to tell without a fully reproducible example, but from this code I get the impression that word1 and word2 represent word pair _tokens_ rather than pair _types_ (otherwise you wouldn't need the unique()). That's a very inefficient way of dealing with co-occurrence data, especially since you've already computed the set of pair types in order to get the co-occurrence counts.
If word1, word2 are type vectors (i.e. every pair occurs just once), then this should give you what you want:
tapply(BB$word2, BB$word1, length)
If they are token vectors, you need to supply your own type counting function, which will be a bit slower
tapply(BB$word2, BB$word1, function (x) length(unique(x)))
On my machine, this takes about 0.2s for 770,000 word pairs.
BTW, you might want to take a look at Unit 4 of the SIGIL course
http://sigil.r-forge.r-project.org/
which has some tips on how you can deal efficiently with co-occurrence data in R.
Best,
Stefan
More information about the R-help
mailing list