[R] [External] Re: help with web scraping

Sat Jul 25 11:10:51 CEST 2020

On 2020-07-24 10:28 -0500, Spencer Graves wrote:
> Dear Rasmus:
> 
> > Dear Spencer,
> >
> > I unified the party tables after the
> > first summary table like this:
> >
> > 	url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> > 	M_sos <- RCurl::getURL(url)
> > 	saveRDS(object=M_sos, file="dcp.rds")
> > 	dat <- XML::readHTMLTable(M_sos)
> > 	idx <- 2:length(dat)
> > 	cn <- unique(unlist(lapply(dat[idx], colnames)))
> 
> This is useful for this application.
> 
> > 	dat <- do.call(rbind,
> > 	  sapply(idx, function(i, dat, cn) {
> > 	    x <- dat[[i]]
> > 	    x[,cn[!(cn %in% colnames(x))]] <- NA
> > 	    x <- x[,cn]
> > 	    x$Party <- names(dat)[i]
> > 	    return(list(x))
> > 	  }, dat=dat, cn=cn))
> > 	dat[,"Date Filed"] <-
> > 	  as.Date(x=dat[,"Date Filed"],
> > 	          format="%m/%d/%Y")
> 
> This misses something extremely 
> important for this application:?  The 
> political office.? That's buried in 
> the HTML or whatever it is.? I'm using 
> something like the following to find 
> that:
> 
> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])

Dear Spencer,

I came up with a solution, but it is not 
very elegant.  Instead of showing you 
the solution, hoping you understand 
everything in it, I istead want to give 
you some emphatic hints to see if you 
can come up with a solution on you own.

- XML::htmlTreeParse(M_sos)
  - *Gandalf voice*: climb the tree 
    until you find the content you are 
    looking for flat out at the level of 
    «The Children of the Div», *uuuUUU*
  - you only want to keep the table and 
    header tags at this level
- Use XML::xmlValue to extract the 
  values of all the headers (the 
  political positions)
- Observe that all the tables on the 
  page you were able to extract 
  previously using XML::readHTMLTable, 
  are at this level, shuffled between 
  the political position header tags, 
  this means you extract the political 
  position and party affiliation by 
  using a for loop, if statements, 
  typeof, names, and [] and [[]] to grab 
  different things from the list 
  (content or the bag itself). 
  XML::readHTMLTable strips away the 
  line break tags from the Mailing 
  address, so if you find a better way 
  of extracting the tables, tell me, 
  e.g. you get

	8805 HUNTER AVEKANSAS CITY MO 64138

  and not 

	8805 HUNTER AVE<br/>KANSAS CITY MO 64138

When you've completed this «programming 
quest», you're back at the level of the 
previous email, i.e.  you have have the 
same tables, but with political position 
and party affiliation added to them.

Best,
Rasmus

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200725/126cd316/attachment.sig>