[R] Newbie - Scrape Data From PDFs?

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Wed Jan 24 09:39:23 CET 2018


And a warning to the OP... PDF files are like packages.... a wide variety of things can be inside, including text in semi-random order, or bitmap images of text... so having a tool that extracts text from the file will only be of use if your PDF files happen to be of the type that contain reasonably unscrambled  text.
-- 
Sent from my phone. Please excuse my brevity.

On January 23, 2018 11:35:38 PM PST, Ulrik Stervbo <ulrik.stervbo at gmail.com> wrote:
>I think I would use pdftk to extract the form data. All subsequent
>manipulation in R.
>
>HTH
>Ulrik
>
>Eric Berger <ericjberger at gmail.com> schrieb am Mi., 24. Jan. 2018,
>08:11:
>
>> Hi Scott,
>> I have never done this myself but I read something recently on the
>> r-help distribution that was related.
>> I just did a quick search and found a few hits that might work for
>you.
>>
>> 1.
>>
>https://medium.com/@CharlesBordet/how-to-extract-and-clean-data-from-pdf-files-in-r-da11964e252e
>> 2. http://bxhorn.com/2016/extract-data-tables-from-pdf-files-in-r/
>> 3.
>>
>https://www.rdocumentation.org/packages/textreadr/versions/0.7.0/topics/read_pdf
>>
>> HTH,
>> Eric
>>
>> On Wed, Jan 24, 2018 at 3:58 AM, Scott Clausen <scottclausen at mac.com>
>> wrote:
>> > Hello,
>> >
>> > I’m new to R and am using it with RStudio to learn the language.
>I’m
>> doing so as I have quite a lot of traffic data I would like to
>explore. My
>> problem is that all the data is located on a number of PDFs. Can
>someone
>> point me to info on gathering data from other sources? I’ve been to
>the R
>> FAQ and didn’t see anything and would appreciate your thoughts.
>> >
>> >  I am quite sure now that often, very often, in matters concerning
>> religion and politics a man's reasoning powers are not above the
>monkey's.
>> >
>> > -- Mark Twain
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list