[R] Extracting xml data to data frames

John Kane jrkrideau at inbox.com
Thu Apr 16 23:17:28 CEST 2015

No attachment : R-help is rather fussy about the files it will accept. You are probably okay with .txt .pdf, or png but even .csv is likely to get stripped.  

The best way to supply data is by using dput()  .  Type ?dput for information or have a look at  http://adv-r.had.co.nz/Reproducibility.html for some hints.  

John Kane
Kingston ON Canada

> -----Original Message-----
> From: g.rudge at bham.ac.uk
> Sent: Thu, 16 Apr 2015 17:57:44 +0000
> To: r-help at r-project.org
> Subject: [R] Extracting xml data to data frames
> Hi Rgonauts,
> I am trying to parse some xml files of transport data using the
> TransExchange format (in this case bus routing information) and obtain
> some data.frames for onward processing for a GIS related task.  Ideally I
> need them in .csv files.
> Each file (an example is attached) contains up to 8 tables of information
> about transport operators and routing information.  I have uploaded an
> example that contains all 8.  In fact I have some hundreds of similar
> files that will need processing. So when I've solved this I will need to
> be able to loop through a bunch of them.
> I'm new to handling xml data and to the xml package so I don't really
> know what I'm doing, this is my first stab at using the xml package.
> So far the workflow goes something like this.
> #get the file
> doc=xmlTreeParse("cen_18-23-D-y11-2.xml")
> top=xmlRoot(doc)
> #look at the names
> top=xmlRoot(doc)
> #pick one of them to use, in this case the forth one, 'routes', a table
> of information about this particular bus route. using some code from
> another forum post, I can get a data.frame with the info i need in it.
> OK I need to do some reshaping but I can handle that later
> fr4<-(top[[4]])
> fr4
> xmlSApply(fr4,function(x) xmlSApply(x,xmlValue))
> df<-as.data.frame(xmlSApply(fr4,function(x) xmlSApply(x,xmlValue)))
> df
> #this works but when I try it with another table, the fifth one say, that
> captures information about the parts of the journey between stops, it
> falls over.
> fr5<-(top[[5]])
> fr5
> xmlSApply(fr5,function(x) xmlSApply(x,xmlValue))
> df<-as.data.frame(xmlSApply(fr5,function(x) xmlSApply(x,xmlValue)))
> df
> Now I guess there is an irregularity in the xml causing this.  I gather
> from other posts I should use Xpath functionality to interrogate this
> section of the data. I've tried reverse engineering some of these
> commands I've seen in solutions to irregular xml problems on other forums
> but not got to what I want. I'm not really up on xml, but I am assuming
> it is something to do with the <JourneySectionPattern id=****> part of
> the file is what is causing the problem?  This looks like there should be
> a field called JouneyPattern ID (only I guess without the space) and then
> the ID code as the actual field contents.
> So my question is, is there a way to parse this table correctly and
> output the resulting df as a csv?
> All help gratefully recieved.  BTW the link to the searhable r-help
> archives seems to be broken.
> GavinR
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop!

More information about the R-help mailing list