[R] Removing words and initials with tm

Sun Shine phaedrusv at gmail.com
Fri Apr 10 15:42:53 CEST 2015


Thanks Jeff.

I'll add that to the ever-growing list my current studies are generating 
daily. :-)

Cheers
S


On 10/04/15 14:32, Jeff Newmiller wrote:
> "I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those."
>
> I cannot think of a better incentive to take action on this hole in your education and buckle down to learn regular expressions. There are many books and tutorials available.
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>                                        Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com> wrote:
>> Hi list
>>
>> Using the tm package, part of the pre-processing work is to remove
>> words, etc. from the corpus.
>>
>> I wish to remove people's names and also their initials which are
>> peppered throughout the corpus. But, because some people's initials are
>>
>> the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or
>> 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has
>> a
>> completely different meaning).
>>
>> Is there any way of doing this without leaving a trail of nonsense
>> half-terms behind? I suspect that it might have something to do with
>> regular expressions, but to be honest, I'm (currently) pretty crap with
>>
>> those.
>>
>> Would it make a difference if I removed initials and names *prior* to
>> converting all text to lower case, so I remove 'AM' and because
>> 'became'
>> is lower case, it should remain unaffected?
>>
>> Any recommendations on how best to proceed with this?
>>
>> Thanks as always.
>> Sun
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list