[R] what is the faster way to search for a pattern in a few million entries data frame ?
Fabien Tarrade
fabien.tarrade at gmail.com
Sun Apr 10 20:03:23 CEST 2016
Hi there,
I have a data frame DF with 40 millions strings and their frequency. I
am searching for strings with a given pattern and I am trying to speed
up this part of my code. I try many options but so far I am not
satisfied. I tried:
- grepl and subset are equivalent in term of processing time
grepl(paste0("^",pattern),df$Strings)
subset(df, grepl(paste0("^",pattern), df$Strings))
- lookup(pattern,df) is not what I am looking for since it is doing an
exact matching
- I tried to convert my data frame in a data table but it didn't improve
things (probably read/write of this DT will be much faster)
- the only way I found was to remove 1/3 of the data frame with the
strings of lowest frequency which speed up the process by a factor x10 !
- didn't try yet parRapply and with a machine with multicore I can get
another factor.
I did use parLapply for some other code but I had many issue with
memory (crashing my Mac).
I had to sub-divide the dataset to have it working correctly but I
didn't manage to fully understand the issue.
I am sure their is some other smart way to do that. Any good
article/blogs or suggestion that can give me some guidance ?
Thanks a lot
Cheers
Fabien
--
Dr Fabien Tarrade
Quantitative Analyst/Developer - Data Scientist
Senior data analyst specialised in the modelling, processing and
statistical treatment of data.
PhD in Physics, 10 years of experience as researcher at the forefront of
international scientific research.
Fascinated by finance and data modelling.
Geneva, Switzerland
Email : contact at fabien-tarrade.eu <mailto:contact at fabien-tarrade.eu>
Phone : www.fabien-tarrade.eu <http://www.fabien-tarrade.eu>
Phone : +33 (0)6 14 78 70 90
LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter
<https://twitter.com/fabtar> Google
<https://plus.google.com/+FabienTarradeProfile/posts> Facebook
<https://www.facebook.com/fabien.tarrade.eu> Google
<skype:fabtarhiggs?call> Xing <https://www.xing.com/profile/Fabien_Tarrade>
More information about the R-help
mailing list