[R] what is the faster way to search for a pattern in a few million entries data frame ?
Fabien Tarrade
fabien.tarrade at gmail.com
Fri Apr 29 21:03:02 CEST 2016
Hi Martin and everybody,
sorry for the long delay. Thanks for all the suggestions. With my code
and my training data I found similar numbers to the one below.
Thanks
Cheers
Fabien
> I did this to generate and search 40 million unique strings
>
> > grams <- as.character(1:4e7) ## a long time passes...
> > system.time(grep("^900001", grams)) ## similar times to grepl
> user system elapsed
> 10.384 0.168 10.543
>
> Is that the basic task you're trying to accomplish? grep(l) goes
> quickly to C, so I don't think data.table or other will be markedly
> faster if you're looking for an arbitrary regular expression (use
> fixed=TRUE if looking for an exact match).
>
> If you're looking for strings that start with a pattern, then in
> R-3.3.0 there is
>
> > system.time(res0 <- startsWith(grams, "900001"))
> user system elapsed
> 0.658 0.012 0.669
>
> which returns the same result as grepl
>
> > identical(res0, res1 <- grepl("^900001", grams))
> [1] TRUE
>
> One can also parallelize the already vectorized grepl function with
> parallel::pvec, with some opportunity for gain (compared to grepl) on
> non-Windows
>
> > system.time(res2 <- pvec(seq_along(grams), function(i)
> grepl("^900001", grams[i]), mc.cores=8))
> user system elapsed
> 24.996 1.709 3.974
> > identical(res0, res2)
> [[1]] TRUE
>
> I think anything else would require pre-processing of some kind, and
> then some more detail about what your data looks like is required.
--
Dr Fabien Tarrade
Quantitative Analyst/Developer - Data Scientist
Senior data analyst specialised in the modelling, processing and
statistical treatment of data.
PhD in Physics, 10 years of experience as researcher at the forefront of
international scientific research.
Fascinated by finance and data modelling.
Geneva, Switzerland
Email : contact at fabien-tarrade.eu <mailto:contact at fabien-tarrade.eu>
Phone : www.fabien-tarrade.eu <http://www.fabien-tarrade.eu>
Phone : +33 (0)6 14 78 70 90
LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter
<https://twitter.com/fabtar> Google
<https://plus.google.com/+FabienTarradeProfile/posts> Facebook
<https://www.facebook.com/fabien.tarrade.eu> Google
<skype:fabtarhiggs?call> Xing <https://www.xing.com/profile/Fabien_Tarrade>
More information about the R-help
mailing list