[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R

Sun Sep 25 04:18:18 CEST 2016

> On Sep 24, 2016, at 11:49 AM, Aarushi Kaushal <kaushalaarushi at gmail.com> wrote:
> 
> Hey there,
> 
> I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi,
> which is involved in financial services, Portfolio management to be
> precise. Recently we've started creating ourselves a database using R for
> all the stocks etc. to be automated and hence analyzed accordingly for
> future investment purposes (data related to which is already available, and
> in our possession).
> 
> I and a colleague of mine, we are currently at the data cleaning stage -
> where we need to organize and format the data according to how we want it
> in the database. The problem lies in notation & symbols used in the
> original csv data files acquired from the government website - where we
> have to do approximate matching (for efficiency) and thereby extract the
> numerics only from that string of characters from the respective columns of
> the dataframe.
> 
> 1.) As of now we are looking at using the agrep function, to detect &
> locate the pattern matches namely - DIVIDEND , SPLIT, BONUS
> 
> 2.) From there on carry out the extraction of the respective numeric values
> associated with these actions in to the corresponding columns -
> BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio),
> SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio),
> FInal Dividend, Interim Dividend & Special Dividend.
> 
> 
> COLUMN PURPOSE
> 
>   1. DIVIDEND-RE.1/- PER SHARE
>   2. AGM/DIV-RS.3.50 PER SHARE
>   3. SPL DIV-RS.2.70 PER SHARE
>   4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4
>   5. FV SPLIT Rs.10 to RE.1
>   6. BON 3:2 + SPLT Rs. 5 to Rs.2.5
>   7. BONUS 4:1
>   8. DIV:10%
> 
> Ex.
> DIVIDEND-RE.1/- PER SHARE
> FINAL_DIV-1
> 
> AGM/DIV-RS.3.50 PER SHARE
> FINAL_DIV-3.50
> 
> SPL DIV-RS.2.70 PER SHARE
> SPECIAL DIV-2.70
> 
> Ex.
> FV SPLIT Rs.10 to RE.1
> SPLIT_NUM - 1
> SPLIT_DEN - 10
> 
> Ex. BONUS 4:1
> BONUS_NUM - 4
> BONUS_DEN - 1
> 
> However, the problem with that is that agrep returns the vector indices
> instead of the string indices which makes it cumbersome to extract the
> numeric values following the respective matches.

Please read ?agrep which was my starting point. (I needed to see if `agrep` was like grep in being capable of returning character values of matches.)

Can you explain what that actually means? What would be a "string index" if it is not the value returned when the parameter to `agrep` is setas:  value=TRUE?

> So I want a Fuzzy logic approach to
> 
>   - check for the presence of SPLIT, DIVIDEND, BONUS
>   - index of which ever cell the pattern match occurs in the column
>   PURPOSE of the data frame
>   - index position of that particular pattern in the string to extract the
>   numerical value following the matched pattern
> 
> *Basically Is there any way in R to determine if the patterns can be
> checked and matched approximately while returning for value - the indices
> for the same in the respective strings?**(such that if in case the symbols
> change furthermore in the future according to the government website's
> notation in the data storage, or the format/positioning/spacing changes -
> it could account for all those changes automatically.)*
> I am attaching below the .csv file consisting of just the column we need to
> carry out the cleaning in for your convenience.
> 
> It would be very helpful, if we could get some guidance as to how to
> proceed further at the earliest.

It would be helpful for us for _you_ to construct a simple example and explain what was desired from it (as is described in the Posting Guide).

-- 

David Winsemius
Alameda, CA, USA