[R] Scraping a web page

Duncan Temple Lang duncan at wald.ucdavis.edu
Fri Dec 4 02:14:26 CET 2009


Hi Michael

If you just want all of the text that is displayed in the
HTML docment, then you might use an XPath expression to get
all the text() nodes and get their value.

An example is

  doc = htmlParse("http://www.omegahat.org/")
  txt = xpathSApply(doc, "//body//text()", xmlValue)

The result is a character vector that contains all the text.

By limiting the nodes to the body, we avoid the content in <head>
such as inlined JavaScript or CSS.

It is also possible that a document may have <script> elements
in the document containing JavaScript that you don't want.
You can omit these

  txt = xpathSApply(doc, "//body//text()[not(ancestor::script)]", xmlValue)

And if there were other elements we wanted to ignore, then you could use

 txt = xpathSApply(doc,
                   "//body//text()[not(ancestor::script) and not(ancestor::otherElement)]",
                   xmlValue)


HTH,

 D.


Michael Conklin wrote:
> I would like to be able to submit a list of URLs of various webpages and extract the "content" i.e. not the mark-up of those pages. I can find plenty of examples in the XML library of extracting links from pages but I cannot seem to find a way to extract the text.  Any help would be greatly appreciated - I will not know the structure of the URLs I would submit in advance.  Any suggestions on where to look would be greatly appreciated.
> 
> Mike
> 
> W. Michael Conklin
> Chief Methodologist
> 
> MarketTools, Inc. | www.markettools.com<http://www.markettools.com>
> 6465 Wayzata Blvd | Suite 170 |  St. Louis Park, MN 55426.  PHONE: 952.417.4719 | CELL: 612.201.8978
> This email and attachment(s) may contain confidential and/or proprietary information and is intended only for the intended addressee(s) or its authorized agent(s). Any disclosure, printing, copying or use of such information is strictly prohibited. If this email and/or attachment(s) were received in error, please immediately notify the sender and delete all copies
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list