[R] Removing words and initials with tm

Sun Shine phaedrusv at gmail.com
Fri Apr 10 13:37:48 CEST 2015

Thanks Jim

Can you say more about a R spell checker, or were you thinking of 
opening the parsed documents in a word processor, e.g. LibreOffice?

After stemming the documents, most of the words are mangled, e.g. 
'people' becomes 'peopl' so I think the spell checker would go crazy! I 
think a lot of this comes down to which sequence one runs the different 
transformations in.


On 10/04/15 12:30, Jim Lemon wrote:
> Hi Sun,
> Good thinking. Looking at your reply, I realized that you may be able 
> to run a spell checker over the output to pick up mangled words.
> Jim
> On Fri, Apr 10, 2015 at 9:17 PM, Sun Shine <phaedrusv at gmail.com 
> <mailto:phaedrusv at gmail.com>> wrote:
>     Hey Jim
>     So far I've re-run the process and sub'bed initials and proper
>     names with blank space, and changed other names (including
>     acronyms) to something less tricky (your e.g. #1 NMR is therefore
>     "NucMagRes", etc.) *before* I converted to lower case. By and
>     large, that seems to cut it, at least for my present purposes.
>     I don't have a workaround for your e.g. #2 though!
>     One really has to have a relatively decent handle on the scope of
>     the variations and text content first. I'm not sure how one would
>     do this kind of thing effectively on a large and unseen corpus.
>     Anyway, thanks for your reply and thoughts.
>     Sun
>     On 10/04/15 11:38, Jim Lemon wrote:
>>     Hi Sun,
>>     In fact, case sensitivity is the default in functions like "sub".
>>     The problem may then become separating initials from acronyms if
>>     they are present in the corpus:
>>     gsub("NM","","An NMR was performed on NM Jones")
>>     [1] "An R was performed on  Jones"
>>     How you are going to deal with names like York may also be tricky:
>>     gsub("York","","Reginald York took a holiday in New York.")
>>     [1] "Reginald  took a holiday in New ."
>>     Jim
>>     On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com
>>     <mailto:phaedrusv at gmail.com>> wrote:
>>         Hi list
>>         Using the tm package, part of the pre-processing work is to
>>         remove words, etc. from the corpus.
>>         I wish to remove people's names and also their initials which
>>         are peppered throughout the corpus. But, because some
>>         people's initials are the same as parts of common words -
>>         e.g. 'am' = 'became' => 'bec e' or 'ec' = 'because' => 'b
>>         ause' or 'ar' = 'arrival' => 'rival' (which has a completely
>>         different meaning).
>>         Is there any way of doing this without leaving a trail of
>>         nonsense half-terms behind? I suspect that it might have
>>         something to do with regular expressions, but to be honest,
>>         I'm (currently) pretty crap with those.
>>         Would it make a difference if I removed initials and names
>>         *prior* to converting all text to lower case, so I remove
>>         'AM' and because 'became' is lower case, it should remain
>>         unaffected?
>>         Any recommendations on how best to proceed with this?
>>         Thanks as always.
>>         Sun
>>         ______________________________________________
>>         R-help at r-project.org <mailto:R-help at r-project.org> mailing
>>         list -- To UNSUBSCRIBE and more, see
>>         https://stat.ethz.ch/mailman/listinfo/r-help
>>         PLEASE do read the posting guide
>>         http://www.R-project.org/posting-guide.html
>>         and provide commented, minimal, self-contained, reproducible
>>         code.

	[[alternative HTML version deleted]]

More information about the R-help mailing list