[R] Removing words and initials with tm

Fri Apr 10 12:38:21 CEST 2015

Hi Sun,
In fact, case sensitivity is the default in functions like "sub". The
problem may then become separating initials from acronyms if they are
present in the corpus:

gsub("NM","","An NMR was performed on NM Jones")
[1] "An R was performed on  Jones"

How you are going to deal with names like York may also be tricky:

gsub("York","","Reginald York took a holiday in New York.")
[1] "Reginald  took a holiday in New ."

Jim

On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com> wrote:

> Hi list
>
> Using the tm package, part of the pre-processing work is to remove words,
> etc. from the corpus.
>
> I wish to remove people's names and also their initials which are peppered
> throughout the corpus. But, because some people's initials are the same as
> parts of common words - e.g. 'am' = 'became' => 'bec e' or 'ec' = 'because'
> => 'b ause' or 'ar' = 'arrival' => 'rival' (which has a completely
> different meaning).
>
> Is there any way of doing this without leaving a trail of nonsense
> half-terms behind? I suspect that it might have something to do with
> regular expressions, but to be honest, I'm (currently) pretty crap with
> those.
>
> Would it make a difference if I removed initials and names *prior* to
> converting all text to lower case, so I remove 'AM' and because 'became' is
> lower case, it should remain unaffected?
>
> Any recommendations on how best to proceed with this?
>
> Thanks as always.
> Sun
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]