[R] package "tm" fails to remove "the" with remove stopwords
Ingo Feinerer
feinerer at logic.at
Sun Nov 15 17:05:36 CET 2009
On Thu, Nov 12, 2009 at 11:29:50AM -0500, Mark Kimpel wrote:
> I am using code that previously worked to remove stopwords using package "tm".
Thanks for reporting. This is a bug in the removeWords() function in
tm version 0.5-1 available from CRAN:
> require(tm)
> myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water")
> text.corp <- Corpus(VectorSource(myDocument))
> #########################
> text.corp <- tm_map(text.corp, stripWhitespace)
> text.corp <- tm_map(text.corp, removeNumbers)
> text.corp <- tm_map(text.corp, removePunctuation)
> ## text.corp <- tm_map(text.corp, stemDocument)
> text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english")))
> dtm <- DocumentTermMatrix(text.corp)
> dtm
> dtm.mat <- as.matrix(dtm)
> dtm.mat
>
> > dtm.mat
> Terms
> Docs falls fetch hill jack jill mainly pail plain rain ran spain the water
> 1 0 0 0 0 0 0 0 0 1 0 1 1 0
> 2 1 0 0 0 0 1 0 1 0 0 0 0 0
> 3 0 0 1 1 1 0 0 0 0 1 0 0 0
> 4 0 1 0 0 0 0 1 0 0 0 0 0 1
The function removeWords() fails to remove patterns at the beginning or at the end
of a line.
This bug is fixed in the latest development version on R-Forge, and
the fix will be included in the next CRAN release.
Please see
https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/pkg/inst/NEWS?root=tm&view=markup
for a list of all bug fixes and changes between each tm version.
Best regards, Ingo Feinerer
--
Ingo Feinerer
Vienna University of Technology
http://www.dbai.tuwien.ac.at/staff/feinerer
More information about the R-help
mailing list