[R] [External] Re: help with web scraping
Spencer Graves
@pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Sat Jul 25 19:43:20 CEST 2020
Dear Rasmus Liland et al.:
On 2020-07-25 11:30, Rasmus Liland wrote:
> On 2020-07-25 09:56 -0500, Spencer Graves wrote:
>> Dear Rasmus et al.:
>
> It is LILAND et al., is it not? ... else it's customary to
> put a comma in there, isn't it? ...
The APA Style recommends "Sharp et al., 2007":
https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html
Regarding Confucius, I'm confused.
> right, moving on:
>
> On 2020-07-25 04:10, Rasmus Liland wrote:
>>
<snip>
>
> Please research using Thunderbird, Claws
> mail, or some other sane e-mail client;
> they are great, I promise.
Thanks. I researched it and turned of HTML. Please excuse: I noticed
it was a problem, but hadn't prioritized time to research and fix it
until your comment. Thanks.
>
>> Please excuse:? Before my last post, I
>> had written code to do all that.?
>
> Good!
>
>> In brief, the political offices are
>> "h3" tags.?
>
> Yes, some type of header element at
> least, in-between the various tables,
> everything children of the div in the
> element tree.
>
>> I used "strsplit" to split the string
>> at "<h3>".? I then wrote a
>> function to find "</h3>", extract the
>> political office and pass the rest to
>> "XML::readHTMLTable", adding columns
>> for party and political office.
>
> Yes, doing that for the political office
> is also possible, but the party is
> inside the table's caption tag, which
> end up as the name of the table in the
> XML::readHTMLTable list ...
>
>> However, this suppressed "<br/>"
>> everywhere.?
>
> Why is that, please explain.
>
I don't know why the Missouri Secretary of State's web site includes
"<br/>" to signal a new line, but it does. I also don't know why
XML::readHTMLTable suppressed "<br/>" everywhere it occurred, but it did
that. After I used gsub to replace "<br/>" with "\n", I found that
XML::readHTMLTable did not replace "\n", so I got what I wanted.
>> I thought there should be
>> an option with something like
>> "XML::readHTMLTable" that would not
>> delete "<br/>" everywhere, but I
>> couldn't find it.?
>
> No, there is not, AFAIK. Please, if
> anyone else knows, please say so *echoes
> in the forest*
>
>> If you aren't aware of one, I can
>> gsub("<br/>", "\n", ...) on the string
>> for each political office before
>> passing it to "XML::readHTMLTable".? I
>> just tested this:? It works.
>
> Such a great hack! IMHO, this is much
> more flexible than using
> xml2::read_html, rvest::read_table,
> dplyr::mutate like here[1]
>
>> I have other web scraping problems in
>> my work plan for the few days.?
>
> Maybe, idk ...
>
>> I will definitely try
>> XML::htmlTreeParse, etc., as you
>> suggest.
>
> I wish you good luck,
> Rasmus
>
> [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells
And I added my solution to this problem to this Stackoverflow thread.
Thanks again,
Spencer
>
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list