[R] what is the faster way to search for a pattern in a few million entries data frame ?

Fabien Tarrade fabien.tarrade at gmail.com
Fri Apr 29 21:03:02 CEST 2016


Hi Martin and everybody,

sorry for the long delay. Thanks for all the suggestions. With my code 
and my training data I found similar numbers to the one below.

Thanks

Cheers

Fabien

> I did this to generate and search 40 million unique strings
>
> > grams <- as.character(1:4e7)        ## a long time passes...
> > system.time(grep("^900001", grams)) ## similar times to grepl
>    user  system elapsed
>  10.384   0.168  10.543
>
> Is that the basic task you're trying to accomplish? grep(l) goes 
> quickly to C, so I don't think data.table or other will be markedly 
> faster if you're looking for an arbitrary regular expression (use 
> fixed=TRUE if looking for an exact match).
>
> If you're looking for strings that start with a pattern, then in 
> R-3.3.0 there is
>
> > system.time(res0 <- startsWith(grams, "900001"))
>    user  system elapsed
>   0.658   0.012   0.669
>
> which returns the same result as grepl
>
> > identical(res0, res1 <- grepl("^900001", grams))
> [1] TRUE
>
> One can also parallelize the already vectorized grepl function with 
> parallel::pvec, with some opportunity for gain (compared to grepl) on 
> non-Windows
>
> > system.time(res2 <- pvec(seq_along(grams), function(i) 
> grepl("^900001", grams[i]), mc.cores=8))
>    user  system elapsed
>  24.996   1.709   3.974
> > identical(res0, res2)
> [[1]] TRUE
>
> I think anything else would require pre-processing of some kind, and 
> then some more detail about what your data looks like is required.

-- 
Dr Fabien Tarrade

Quantitative Analyst/Developer - Data Scientist

Senior data analyst specialised in the modelling, processing and 
statistical treatment of data.
PhD in Physics, 10 years of experience as researcher at the forefront of 
international scientific research.
Fascinated by finance and data modelling.

Geneva, Switzerland

Email : contact at fabien-tarrade.eu <mailto:contact at fabien-tarrade.eu>
Phone : www.fabien-tarrade.eu <http://www.fabien-tarrade.eu>
Phone : +33 (0)6 14 78 70 90

LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter 
<https://twitter.com/fabtar> Google 
<https://plus.google.com/+FabienTarradeProfile/posts> Facebook 
<https://www.facebook.com/fabien.tarrade.eu> Google 
<skype:fabtarhiggs?call> Xing <https://www.xing.com/profile/Fabien_Tarrade>



More information about the R-help mailing list