[R] Classifying large text corpora using R

andy1234 listanand at gmail.com
Fri Sep 2 20:23:28 CEST 2011


Dear everyone, 

I am new to R, and I am looking at doing text classification on a huge
collection of documents (>500,000) which are distributed among 300 classes
(so basically, this is my training data). Would someone please be kind
enough to let me know about the R packages to use and their scalability
(time and space)? 

I am very new to R and do not know of the right packages to use. I started
off by trying to use the tm package (http://cran.r-project.org/package=tm)
for pre-processing and FSelector
(http://cran.r-project.org/web/packages/FSelector/index.html) package for
feature selection - but both of these are incredibly slow and completely
unusable for my task. 

So the question is what are the right packages to use (for pre-processing,
feature selection, and classification)? Please consider the fact that I may
be dealing with data of millions of dimensions which may not even fit in
memory. 

I posted on this issue twice
(http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html
,
http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html)
but did not get any response. This is a very critical piece of my research
and I have been struggling with this issue for a long time. Please consider
helping me out, directly or by pointing me to any other software/website
that you think may be more appropriate. 

Many thanks in advance.

--
View this message in context: http://r.789695.n4.nabble.com/Classifying-large-text-corpora-using-R-tp3786787p3786787.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list