[BioC] ASN.1
Ben Tupper
btupper at bigelow.org
Mon Jan 21 01:59:19 CET 2013
Hello again,
> -----Original Message-----
> From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of Ben Tupper
> Sent: Wednesday, January 09, 2013 10:43 AM
> To: Bioconductor mailing list
> Subject: [BioC] ASN.1
>
> Hello,
>
> We make extensive use of NCBI's blast application [1] in our workflows. One of the optional output formats of the application is XML formatted data. That output works very well for most of our purposes as this form of output is complete. However, we have encountered an issue where, for very large inputs, output to XML becomes very resource heavy for our server - bring our workflow to a crawl. We are trying end-runs around the issue including using other output format options (flat ascii tables, html, etc.), and also by saving the output to NCBI's ASN.1 archive format [2] and then converting using the blast_formatter application [3] but none fit the bill.
>
> NCBI makes its AsnLib tool kit available, but don't have the resources at this time to dive into C and C++. We are wondering if there are any resources available in R for reading NCBI's ASN.1 archive format. Do such beasts exist?
>
Thanks to a number of off-list communications, I realized that this approach was not going to be fruitful. My boss suggested that we try splitting input FASTA files into smaller pieces and then outputting smaller XML files. That works well (and that's why he's the boss!) These XML files are easier to deal with. One little surprise is that *sometimes* blast outputs multi-document XML files (more than one xmlRoot in a single file). The XML package doesn't appear to be able to work with those. Instead, we then mine the as if they were text, find the root node endpoints, and feed those the the xml tree parser.
The XML file(s) have all the things we want for our home-made flat tables and html files. Below is the function we use to good end so far. It also works with the multi-document XML files.
CHeers,
Ben
library(XML)
# read one or more blast XML files including a multiple document container-style
# XML files
#
# file - character vector of one or more xml filenames, possibly compressed
# useInternalNodes (default is TRUE) see xmlTreeParse
# asList (default = FALSE) if TRUE return a list of xml root nodes even if
# there is only one. Ignored if the the length of the 'file'
# input is greater that 1 or one or more files are multi-document.
read.blastXML <- function(file, useInternalNodes = TRUE, asList = FALSE){
# we have one or more - a recursion is required
if (length(file) > 1){
x <- lapply(file, read.blastXML,
useInternalNodes = useInternalNodes)
# we must unlist because any one of the input files may be
# multi-document xml files
return(unlist(x, recursive = TRUE))
}
# is it a simple one-document file?
x <- try( xmlRoot(xmlTreeParse(file, useInternalNodes = useInternalNodes)),
silent = TRUE)
# if not, then it could be a multi-document xml file. In that case we scan the
# file into a character vector, find the <?xml ...> lines and then parse the xml
# in slabs of text
if (inherits(x, "try-error") ) {
cat("read.blastXML: unable to read xml file... trying as multi-document xml\n")
# scan in as text
ff <- gzfile(file)
s <- scan(ff, what = character(), sep = "\n", quiet = TRUE)
close(ff)
# the start stop points
ix <- c(grep("^<.xml", s), length(s) + 1)
n <- length(ix) - 1
# results list
x <- vector(mode = "list", length = n )
for (i in seq(from = 1, to = n)) {
# note that we may end up with 'try-errors' in each element
# if so the end user will have to figure out what to do next
x[[i]] <- try(xmlRoot(xmlTreeParse(s[ ix[i]:(ix[i+1] - 1) ],
asText = TRUE,
useInternalNodes = useInternalNodes)))
}
} else {
if (asList) x <- list(x)
}
return(x)
}
> Thanks,
> Ben
>
> [1] http://www.ncbi.nlm.nih.gov/books/NBK1763/
> [2] http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/ASNLIB.HTML
> [3] http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/NCBI/blast_formatter.txt
>
Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine 04575-0475
http://www.bigelow.org
More information about the Bioconductor
mailing list