[R] Grap Element from Web Page

Sparks, John James jspark4 at uic.edu
Fri Aug 16 08:45:14 CEST 2013


Thanks, the second approach worked fine on Windows.

--JJS

On Thu, August 15, 2013 8:38 am, Jeffrey Dick wrote:
> Sorry, I can't generate an error when running those commands in R on Linux
> 64-bit. But if I move to Windows (R version 3.0.1, XML_3.98-1.1), I get a
> different error ...
>
>> require(XML)
> Loading required package: XML
>> doc <- htmlTreeParse("
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> ")
>> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
> Input is not proper UTF-8, indicate encoding !
> Bytes: 0xC2 0x0A 0x20 0x20
> Error: 1: Input is not proper UTF-8, indicate encoding !
> Bytes: 0xC2 0x0A 0x20 0x20
>> node <- getNodeSet(doc, "//link[@rel='alternate']" )
> Error in UseMethod("xpathApply") :
>   no applicable method for 'xpathApply' applied to an object of class
> "XMLDocumentContent"
>
> ... note that I've tried both doc[[1]] and doc in the function call. Also,
> only the XML library is required. I'm not sure what's going on with the
> character encoding error, might be my system settings. Reading the help
> page (?htmlTreeParse) provides a clue to use the htmlParse function
> instead, equivalent to setting the useInternalNodes parameter to TRUE ...
> "These can then be searched using XPath expressions via 'xpathApply' and
> 'getNodeSet'." That seems to be relevant to this case.
>
>> doc <- htmlParse("
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> ")
>> node <- xpathSApply(doc, "//link[@rel='alternate']", xmlAttrs)
>> node
>
> [,1]
>
> rel
> "alternate"
>
> type
> "application/atom+xml"
>
> title
> "ATOM"
>
> href
> "/cgi-bin/browse-edgar?action=getcompany&CIK=0000789019&type=&dateb=&owner=exclude&count=40&output=atom"
>> strsplit(strsplit(node[[4]], "CIK=")[[1]][2], "&type")[[1]][1]
> [1] "0000789019"
>
> Perhaps that approach is less prone to error.
>
>
> On Thu, Aug 15, 2013 at 12:48 PM, Sparks, John James
> <jspark4 at uic.edu>wrote:
>
>> Thanks so much for looking into this for me.
>>
>> Unfortunately, I get an error when I execute your code.  Is there a
>> library that you loaded that I haven't?
>>
>> require(scrapeR)
>> require(XML)
>> require(RCurl)
>> doc<-htmlTreeParse("
>> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>> ")
>> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
>> Error in UseMethod("xpathApply") :
>>   no applicable method for 'xpathApply' applied to an object of class
>> "character"
>>
>>
>> Guidance would be much appreciated.
>>
>> --JJS
>>
>>
>>
>> On Wed, August 14, 2013 4:19 am, Jeffrey Dick wrote:
>> > Hi,
>> >
>> > There are many occurrences of the CIK number in the page source. This
>> > pulls
>> > out the first node containing it:
>> >
>> > node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
>> >
>> > From there you can extract the number. Here's one way to do it.
>> >
>> > strsplit(strsplit(unlist(node)[[5]], "CIK=")[[1]][2], "&type")[[1]][1]
>> >
>> > Jeff
>> >
>> >
>> > On Wed, Aug 14, 2013 at 1:34 PM, Sparks, John James <jspark4 at uic.edu>
>> > wrote:
>> >
>> >> Dear R Helpers,
>> >>
>> >> I would like to pull the CIK number from the web page
>> >>
>> >>
>> >>
>> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>> >>
>> >> If you put this web page into your browser you will see the CIK
>> number
>> >> in
>> >> red on the left side of the page near the top.
>> >>
>> >> When I try the basic
>> >> require(scrapeR)
>> >> require(XML)
>> >> require(RCurl)
>> >> doc
>> >> <-htmlTreeParse("
>> >>
>> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>> >> ")
>> >> str(doc)
>> >>
>> >> I get a large number of items in the data frame that I don't know how
>> to
>> >> interpret.  Both
>> >> tables <- readHTMLTable(doc)
>> >>
>> >> and
>> >>
>> >> list<-xmlToList(doc)
>> >>
>> >> result in errors.
>> >>
>> >> Any (positive) guidance would be much appreciated.
>> >>
>> >> --John J. Sparks, Ph.D.
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>>
>>
>>
>



More information about the R-help mailing list