[R] [External] Re: help with web scraping
Rasmus Liland
jr@| @end|ng |rom po@teo@no
Sat Jul 25 11:10:51 CEST 2020
On 2020-07-24 10:28 -0500, Spencer Graves wrote:
> Dear Rasmus:
>
> > Dear Spencer,
> >
> > I unified the party tables after the
> > first summary table like this:
> >
> > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> > M_sos <- RCurl::getURL(url)
> > saveRDS(object=M_sos, file="dcp.rds")
> > dat <- XML::readHTMLTable(M_sos)
> > idx <- 2:length(dat)
> > cn <- unique(unlist(lapply(dat[idx], colnames)))
>
> This is useful for this application.
>
> > dat <- do.call(rbind,
> > sapply(idx, function(i, dat, cn) {
> > x <- dat[[i]]
> > x[,cn[!(cn %in% colnames(x))]] <- NA
> > x <- x[,cn]
> > x$Party <- names(dat)[i]
> > return(list(x))
> > }, dat=dat, cn=cn))
> > dat[,"Date Filed"] <-
> > as.Date(x=dat[,"Date Filed"],
> > format="%m/%d/%Y")
>
> This misses something extremely
> important for this application:? The
> political office.? That's buried in
> the HTML or whatever it is.? I'm using
> something like the following to find
> that:
>
> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
Dear Spencer,
I came up with a solution, but it is not
very elegant. Instead of showing you
the solution, hoping you understand
everything in it, I istead want to give
you some emphatic hints to see if you
can come up with a solution on you own.
- XML::htmlTreeParse(M_sos)
- *Gandalf voice*: climb the tree
until you find the content you are
looking for flat out at the level of
«The Children of the Div», *uuuUUU*
- you only want to keep the table and
header tags at this level
- Use XML::xmlValue to extract the
values of all the headers (the
political positions)
- Observe that all the tables on the
page you were able to extract
previously using XML::readHTMLTable,
are at this level, shuffled between
the political position header tags,
this means you extract the political
position and party affiliation by
using a for loop, if statements,
typeof, names, and [] and [[]] to grab
different things from the list
(content or the bag itself).
XML::readHTMLTable strips away the
line break tags from the Mailing
address, so if you find a better way
of extracting the tables, tell me,
e.g. you get
8805 HUNTER AVEKANSAS CITY MO 64138
and not
8805 HUNTER AVE<br/>KANSAS CITY MO 64138
When you've completed this «programming
quest», you're back at the level of the
previous email, i.e. you have have the
same tables, but with political position
and party affiliation added to them.
Best,
Rasmus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200725/126cd316/attachment.sig>
More information about the R-help
mailing list