[R] Extracting a a chunk of text from a pdf file
Victor
vdemart at gmail.com
Sun Sep 18 20:16:19 CEST 2011
That's exactly the way I work. Here you are a chunk of text of my script.
To put it in a nutshell I'm already extracting - by means of grep and gsub from indweb (luckily an html file) - the web addresses
like http://www.terna.it/LinkClick.aspx?fileticket=TTQuOPUf%2fs0%3d&tabid=435&mid=3072 and the likes, pdf files (unfortunately for me).
That's why I need to "translate" the pdf into a txt file.
Ciao
Vittorio
==============================
indweb<-"http://www.terna.it/default/Home/SISTEMA_ELETTRICO/dispacciamento/dati_esercizio/dati_giornalieri/confronto.aspx"
testo<-readLines(indweb)
k<-grep("^(.)+dnn_ctr3072_DocumentTerna_grdDocuments_(.)+CategoryCell\">(\\d\\d)/(\\d\\d)/201(\\d)",testo)
n<-length(k)
# Poichè le date sono in ordine decrescente, ordina in ordine crescente
k<-k[order(k,decreasing=TRUE)]
for (i in 1:length(k) ) {
data<-gsub("^(.)+dnn_ctr3072_DocumentTerna_grdDocuments_(.)+CategoryCell\">","",testo[k[1]])
data<-paste(substr(data,7,10), substr(data,4,5), substr(data,1,2), sep="-")
mysel<-paste("select count(*) from richiesta where data=\"",data,"\";",sep="")
dataesiste<-as.integer(dbGetQuery(con,mysel))
if (dataesiste == 0) {
rif<-gsub("\">Confronto Giornaliero(.)+","",testo[k[30]])
rif<-gsub("^(.)+href=\"","",rif)
pag<-paste("http://www.terna.it",rif,sep="")
pagina<-readLines(pag)
………………………………………………….
………………………………………………….
………………………………………………….
Il giorno 18/set/2011, alle ore 18:25, Joshua Wiley ha scritto:
> On Sun, Sep 18, 2011 at 7:44 AM, Victor <vdemart at gmail.com> wrote:
>> Unfortunately pdf2text doesn't seem to exist either in linux or mac osx.
>
> I think Jeff's main point was to search for software specific for your
> task (convert a pdf to text). Formatting will be lost so once you get
> your text files, I would look at regular expressions to try to find
> the right part of text to grab. Some general functions that seem like
> they might be relevant:
>
> ## for getting the text into R
> ?readLines
> ?scan
> ## for finding the part you need
> ?regexp
> ?grep
>
> Cheers,
>
> Josh
>
>
>> Ciao Vittorio
>>
>> Il giorno 17/set/2011, alle ore 21:00, Jeff Newmiller ha scritto:
>>
>>> Doesn't seen like an R task, but see pdf2text? (From pdftools, UNIX command line tools)
>>> ---------------------------------------------------------------------------
>>> Jeff Newmiller The ..... ..... Go Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
>>> Live: OO#.. Dead: OO#.. Playing
>>> Research Engineer (Solar/Batteries O.O#. #.O#. with
>>> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
>>> ---------------------------------------------------------------------------
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> Victor <vdemart at gmail.com> wrote:
>>> In an R script I need to extract some figures from many web pages in pdf format. As an example see http://www.terna.it/LinkClick.aspx?fileticket=TTQuOPUf%2fs0%3d&tabid=435&mid=3072 from which I would like to extract the "Totale: 1,025,823").
>>> Is there any solution?
>>> Ciao
>>> Vittorio
>>>
>>>
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, ATS Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
More information about the R-help
mailing list