[R] PDF extraction with tm package

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Fri Jul 22 17:20:06 CEST 2016


This is neither the Xpdf support forum nor the Windows Setup Program Reinvention support group... and you really need to read and follow the Posting Guide for the R mailing lists.

FWIW I would guess that you need to learn about environment variables and in particular about the PATH variable. There are subtleties about when and how they get defined that are OS-specific and certainly off topic here that may trip you up along the way. Alternatively, you may read the Xpdf documentation or a how-to blog about Xpdf that gives you a recipe, but again that is not about R. Once you can start a CMD shell and run the command directly then you are most of the way to getting R to invoke it.
-- 
Sent from my phone. Please excuse my brevity.

On July 21, 2016 5:26:26 PM PDT, Steven Kang <stochastickang at gmail.com> wrote:
>Hi R users,
>
>I’m having some issues trying to extract texts from PDF file using tm
>package.
>
>Here are the steps that were carried out:
>
>1. Downloaded and installed the following programs:
>
>- Xpdf (Copied the ‘bin32’, ‘bin64’, ‘doc’ folders into ‘C:\Program
>Files\Xpdf’ directory; also added C:\Program
>Files\Xpdf\bin64\pdfinfo.exe &
>C:\Program Files\Xpdf\bin64\pdftotext.exe in existing PATH
>
>- Tesseract
>
>- Imagemagick
>
>2. Used the following scripts and the corresponding error messages:
>
># Directory where PDF files are stored
>
>>cname <- getwd()
>
>>Corpus(DirSource(cname), readerControl=list(reader = readPDF))
>
>Error in system2("pdftotext", c(control$text, shQuote(x), "-"), stdout
>=
>TRUE) :
>'"pdftotext"' not found
>
> In addition: Warning message:
>
>running command '"pdfinfo" "C:\Users\R_Files\XXX.pdf"' had status 127
>
>>file.exists(Sys.which(c("pdfinfo","pdftpotext")))
>[1] FALSE FALSE
>
>It seems like R can’t find pdfinfo & pdftotext exe files, but not sure
>as
>to why this would be the case despite xpdf files being copied into
>‘C:\Program Files’ (Im using Windows 7 64bits)
>
>I’m aware that ‘pdf_text’ function from pdftools package can extract
>texts
>from PDF file and outputs into a string. But I was after something
>which is
>able to convert PDF (ie transaction data) into a dataframe without
>regular
>expression. Is tm package capable of doing this conversion? Are there
>any
>other alternatives to these methods?
>
>Your expertise in resolving this problem would be highly appreciated.
>
>
>Steve
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list