[R] PDF extraction with tm package

Fri Jul 22 02:26:26 CEST 2016

Hi R users,

I’m having some issues trying to extract texts from PDF file using tm
package.

Here are the steps that were carried out:

1. Downloaded and installed the following programs:

- Xpdf (Copied the ‘bin32’, ‘bin64’, ‘doc’ folders into ‘C:\Program
Files\Xpdf’ directory; also added C:\Program Files\Xpdf\bin64\pdfinfo.exe &
C:\Program Files\Xpdf\bin64\pdftotext.exe in existing PATH

- Tesseract

- Imagemagick

2. Used the following scripts and the corresponding error messages:

# Directory where PDF files are stored

>cname <- getwd()

>Corpus(DirSource(cname), readerControl=list(reader = readPDF))

Error in system2("pdftotext", c(control$text, shQuote(x), "-"), stdout =
TRUE) :
'"pdftotext"' not found

 In addition: Warning message:

running command '"pdfinfo" "C:\Users\R_Files\XXX.pdf"' had status 127

>file.exists(Sys.which(c("pdfinfo","pdftpotext")))
[1] FALSE FALSE

It seems like R can’t find pdfinfo & pdftotext exe files, but not sure as
to why this would be the case despite xpdf files being copied into
‘C:\Program Files’ (Im using Windows 7 64bits)

I’m aware that ‘pdf_text’ function from pdftools package can extract texts
from PDF file and outputs into a string. But I was after something which is
able to convert PDF (ie transaction data) into a dataframe without regular
expression. Is tm package capable of doing this conversion? Are there any
other alternatives to these methods?

Your expertise in resolving this problem would be highly appreciated.

Steve

	[[alternative HTML version deleted]]