[R] Scraping a web page
Duncan Temple Lang
duncan at wald.ucdavis.edu
Fri Dec 4 02:14:26 CET 2009
Hi Michael
If you just want all of the text that is displayed in the
HTML docment, then you might use an XPath expression to get
all the text() nodes and get their value.
An example is
doc = htmlParse("http://www.omegahat.org/")
txt = xpathSApply(doc, "//body//text()", xmlValue)
The result is a character vector that contains all the text.
By limiting the nodes to the body, we avoid the content in <head>
such as inlined JavaScript or CSS.
It is also possible that a document may have <script> elements
in the document containing JavaScript that you don't want.
You can omit these
txt = xpathSApply(doc, "//body//text()[not(ancestor::script)]", xmlValue)
And if there were other elements we wanted to ignore, then you could use
txt = xpathSApply(doc,
"//body//text()[not(ancestor::script) and not(ancestor::otherElement)]",
xmlValue)
HTH,
D.
Michael Conklin wrote:
> I would like to be able to submit a list of URLs of various webpages and extract the "content" i.e. not the mark-up of those pages. I can find plenty of examples in the XML library of extracting links from pages but I cannot seem to find a way to extract the text. Any help would be greatly appreciated - I will not know the structure of the URLs I would submit in advance. Any suggestions on where to look would be greatly appreciated.
>
> Mike
>
> W. Michael Conklin
> Chief Methodologist
>
> MarketTools, Inc. | www.markettools.com<http://www.markettools.com>
> 6465 Wayzata Blvd | Suite 170 | St. Louis Park, MN 55426. PHONE: 952.417.4719 | CELL: 612.201.8978
> This email and attachment(s) may contain confidential and/or proprietary information and is intended only for the intended addressee(s) or its authorized agent(s). Any disclosure, printing, copying or use of such information is strictly prohibited. If this email and/or attachment(s) were received in error, please immediately notify the sender and delete all copies
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list