[R] Question about XML package (accurately access one attribute in an multi-attribution node on the web page)
Humphrey Zhao
humphrey.zhao at yahoo.com
Tue Jun 16 15:01:55 CEST 2015
Dear Sir/Madam:
Thank you for your attention to my question. I have downloaded the source code of some web pages by RCurl, and I am trying to extract the URL from them. In these web pages, there are many nodes contains the same URL, such like the followings:
<a href=\"http://cos.name/2015/05/the-data-wisdom-for-data-science/\" rel=\"bookmark\">
<a href=\"http://blog.shakirm.com/2015/03/a-statistical-view-of-deep-learning-ii-auto-encoders-and-free-energy/\" target=\"_blank\">
<a href=\"http://cos.name/2015/05/the-data-wisdom-for-data-science/#more-10947\" class=\"more-link\">
I want to accurately choose the URL I need(the "href" in the first one), and I tried many ways the most accuracy is just like the following:
library(XML)
#links<-getHTMLLinks(base.html, xpQuery = "//a/@href")
links<-getHTMLLinks(base.html, xpQuery = c("//a/href[@rel='bookmark']"))
However, I still believe that there is a correct method to do this very well, but I could not find it. I wonder if you could give me some advice on solving this problem. And I would be most grateful if you could reply at your earliest convenience. Looking forward to hearing from you. Thank you very much.
Sincerely yours
Humphrey Zhao
[[alternative HTML version deleted]]
More information about the R-help
mailing list