[R] tm: Read a single text file into a corpus as single document?

Juan Carlos Borrás jcborras at gmail.com
Tue Jul 19 11:48:41 CEST 2011


Some hints:
list.files() will return the list of files in a directory
readLines() will allow you to load text files as vectors of lines
strsplit() will allow you to break lines into words
c(x,y) concatenates vectors x and y ; x <- c(x,y) appends vector y to x
unique() will allow you to get rid of repeats
And the Map/Reduce family of functions will allow you to write what
you want in about 15 lines of concise R code with no loops.

Hope it helps,
Cheers,
jcb!

On Tue, Jul 19, 2011 at 11:11 AM, Alexander James Rickett
<ack.vandal at gmail.com> wrote:
> Hello everyone,
>
> I'm doing some JGR (a gui frontend for R) development, specifically adding functionality from tm.  In order to enable users to select some text files from a file dialog, and turn them into a corpus, I need to be able to generate a corpus using a *SINGLE* text file as a single document, and to append a new document to an existing corpora.  I know if I could read files into single character vectors I'd be in business, but I can't find how to do this either.  This seems like a no-brainer, so I'm at my wits' end.
>
> Here's pseudo code of what I'd like to be able to do:
>
> ##########################################
>> corp1doc <- Corpus(singleTextDocSource("path/to/doc")) #read in 1 text doc as a 1-document corpus
>> corp1doc
>        A corpus with 1 text document
>
>> corp1doc[[2]] <- AnotherSingleTextDoc("path/to/doc") #append a second document to the same corpus
>> corp1doc
>        A corpus with 2 text documents
> ##########################################
>
> I can almost do this with dirSource, by setting pattern='filename', but this requires me to also to separate the path to the enclosing directory, which shouldn't be necessary.
>
> Thanks for taking a look!



More information about the R-help mailing list