[R] Extracting complete information from XML data file using R-Nested Lists
sowmiyan
sowmiyan0508 at gmail.com
Sun Jan 24 18:27:01 CET 2016
I am working with a XML, which can be found in the link Sample XML file
<https://www.dropbox.com/s/8kn9g8xev2u5n8o/Dummy.xml?dl=0&preview=Dummy.xml>
I am trying to extract each and every fields information to a csv file. I
want my output to be as below: Required output:
*Total of 20 columns and 2 rows*
DateCreated DateModified Creator.UserAccountName Creator.PersonName
Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName
Modifier..attrs.referenceNumber AdditionalEmailStr AdditionalComment
DateIssued DocumentaryInstructions NominationParcel.attr.Referencenumber
NominationParcel.SecondContractNumber
NominationParcel.Coordinator.RefernceNumber
NominationParcel.Coordinator.Username NominationParcel.Coordinator.Email
NominationParcel.Coordinator.Office.Name
NominationParcel.Coordinator.Office.Email
NominationParcel.Coordinator.Office.attrs.referenceNumber
Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Good work 7 sam
Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Nicely Performed 10 107 102
But I am not able to get my output in the required format. I have tried in
two different ways
1 Below is my first code, the problem with this is that my NULL fields are
not getting captured correctly and there is spillover of data. Also I am
not able to capture all the fields of nested lists in the XML
*Code 1*
doc <- xmlParse("Dummy.xml")
lst<-xmlToList(doc)
f <- function(col) do.call(rbind, lapply(lst, function(x)
unlist(x[cols])));
cols
<-c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued",
"DocumentaryInstructions", "NominationParcel" );
res <- setNames(lapply(cols, f), cols);
list2env(res, .GlobalEnv)
*Output 1*
DateCreated DateModified Creator.UserAccountName Creator.PersonName
Creator..attrs.referenceNumber Modifier.UserAccountName Modifier.PersonName
Modifier..attrs.referenceNumber AdditionalComment
NominationParcel.Coordinator.UserAccountName
NominationParcel.Coordinator.Office..attrs.referenceNumber
NominationParcel.Coordinator..attrs.referenceNumber
NominationParcel..attrs.referenceNumber
Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Good Work sam 7
Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker Merryn Kolker
15351 mkolker Merryn Kolker 15351 Nicely performed 102 107 10
2007-11-25T17:18:01
2 To avoid spillover of information of one cell to other because of "NULL",
I have used for loop to replace the NULL cells with NA. By using this I was
able to capture the correct data, but I could not get all the fields
information present in the XML
*Code 2*
doc <- xmlParse("Dummy.xml")
lstsub<-xmlToList(doc)
for(i in 1:length(lstsub))
{
for(j in 1:length(lstsub[[i]]))
{
lstsub[[i]][[j]]=
ifelse(is.null(lstsub[[i]][[j]]),NA,lstsub[[i]][[j]])
if(length(lstsub[[i]][[j]])>1)
{
for(k in 1:length(lstsub[[i]][[j]]))
{
lstsub[[i]][[j]][[k]]=
ifelse(is.null(lstsub[[i]][[j]][[k]]),NA,lstsub[[i]][[j]][[k]])
if(length(lstsub[[i]][[j]][[k]])>1)
{
for(l in 1:length(lstsub[[i]][[j]][[k]]))
{
lstsub[[i]][[j]][[k]][[l]]=
ifelse(is.null(lstsub[[i]][[j]][[k]][[l]]),NA,lstsub[[i]][[j]][[k]][[l]])
}
}
}
}
}
}
f <- function(col) do.call(rbind, lapply(lstsub, function(x)
unlist(x[cols])));
cols <-
c("DateCreated","DateModified","Creator","Modifier","AdditionalEmailStr","AdditionalComment","DateIssued",
"DocumentaryInstructions", "NominationParcel" );
res <- setNames(lapply(cols, f), cols);
list2env(res, .GlobalEnv)
write.csv(Creator,"dummy_2.csv")
*Output 2*
DateCreated DateModified Creator Modifier
AdditionalEmailStr AdditionalComment DateIssued DocumentaryInstructions
Nomination 2007-11-25T17:01:32 2007-11-25T17:11:09 mkolker mkolker NA
Good Work NA NA
Nomination 2007-11-25T17:18:01 2007-11-25T17:19:11 mkolker mkolker NA
Nicely performed NA NA
Could somebody please help me in how could I get the required output
I have posted the same question in Stackoverflow and the link is here (it
might help in giving more clear picture)
http://stackoverflow.com/questions/34963724/extracting-complete-information-from-nested-lists-in-xml-to-a-data-frame-using-r/34963821#34963821
Regards,
Sowmiyan
[[alternative HTML version deleted]]
More information about the R-help
mailing list