[R] Classifying large text corpora using R
andy1234
listanand at gmail.com
Sun Sep 4 03:26:15 CEST 2011
Daniel Malter wrote:
>
> Take a look here: http://www.jstatsoft.org/v25/i05/paper
>
> HTH,
> Da.
>
>
> andy1234 wrote:
>>
>> Dear everyone,
>>
>> I am new to R, and I am looking at doing text classification on a huge
>> collection of documents (>500,000) which are distributed among 300
>> classes (so basically, this is my training data). Would someone please be
>> kind enough to let me know about the R packages to use and their
>> scalability (time and space)?
>>
>> I am very new to R and do not know of the right packages to use. I
>> started off by trying to use the tm package
>> (http://cran.r-project.org/package=tm) for pre-processing and FSelector
>> (http://cran.r-project.org/web/packages/FSelector/index.html) package for
>> feature selection - but both of these are incredibly slow and completely
>> unusable for my task.
>>
>> So the question is what are the right packages to use (for
>> pre-processing, feature selection, and classification)? Please consider
>> the fact that I may be dealing with data of millions of dimensions which
>> may not even fit in memory.
>>
>> I posted on this issue twice
>> (http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html
>> ,
>> http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html)
>> but did not get any response. This is a very critical piece of my
>> research and I have been struggling with this issue for a long time.
>> Please consider helping me out, directly or by pointing me to any other
>> software/website that you think may be more appropriate.
>>
>> Many thanks in advance.
>>
>
Hi,
Many thanks for your reply.
I did in fact mention in my e-mail that I have looked at tm package. It does
not scale well at all.
Then there are other stages in the pipeline - feature selection,
classification etc. and I need to find suitable R packages for those also.
Any other thoughts?
Thanks.
Andy
--
View this message in context: http://r.789695.n4.nabble.com/Classifying-large-text-corpora-using-R-tp3786787p3788667.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list