[BioC] Reading Paired End Native Report Format in ShortRead
Martin Morgan
mtmorgan at fhcrc.org
Mon Jul 16 00:31:07 CEST 2012
On 07/13/2012 11:17 AM, Murli wrote:
> Hi Martin,
> Thanks for your suggestions. I have tried for some time to parse the
> novoAlign output using read.table, with no success yet. I have just
> started using shortRead and have not worked with GenomicRanges yet. I
> have created two files of the paired end data containing ~500 lines
> of data that may be downloaded from
> http://bioinformatics.iusb.edu/seqSubset.tar . Would it be possible
> for your take a look at the format and guide me a little here? I
> would greatly appreciate it.
> Cheers../Murli
Really I think the most straight-forward approach is to get novoAlign to
generate SAM or BAM output; if SAM then use Rsamtools::asBam to convert
to BAM. You can then read the data in with Rsamtools::scanBam,
GenomicRanges::readGappedAlignments, and in devel
GenomicRanges::readGappedAlignmentPairs. Otherwise you'll spend a lot of
time working through the appropriate interpretation of the novoAlign fields.
I'm attaching a script that might at least provide some inspiration if
you really want to parse the current text format; there are some
significant problems (see the FIXME's, for example) and some aspects of
your files look weird, e.g., the second column contains 'S' indicating
single-end alignments, but the intention seems to have been to align
paired end data?
Martin
>
>
>
> On Thu, Jul 12, 2012 at 2:58 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 07/12/2012 11:42 AM, Murli Nair [guest] wrote:
>>>
>>>
>>> Hi,
>>>
>>> I am trying to read the alignments generated using NovoAlign. The format I
>>> have the data is Paired End Native Report
>>> Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf).
>>> What is the most efficient way to read this data into ShortRead? Since it
>>> is paired end data I have two files corresponding to the two sides.
>>> I tried without success using the different formats using readAligned(). I
>>> also read an earlier posting about it which suggests to convert it to SAM
>>> format.
>>> I would appreciate your suggestions.
>>
>>
>> From the document you reference
>>
>> Three output formats are provided.
>>
>> 1. Native
>>
>> 2. Extended Native
>>
>> 3. Pairwise
>>
>> 4. SAM
>>
>> If Paired End Native Report Format is 1 or 2 with a single record per line
>> then I believe the only support for input would be as tab-delimited files
>> (read.table and friends; these are flexible and could easily be used to
>> iterate through a large file in a memory efficient way); you would then use
>> an appropriate constructor, e.g., GenomicRanges::GappedAlignmentPairs, to
>> create an object that you could manipulate. Format 3 looks challenging to
>> parse.
>>
>> Generally, for aligned reads aim for BAM files, which is output format 4
>> followed by using Rsamtools or other with asBam, sortBam, indexBam to create
>> a sorted bam file and index. use GenomicRanges::readGappedAlignmentPairs for
>> many paired-end tasks.
>>
>> It might help to think a little further ahead about what you want to do,
>> e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq
>> differential expression to count reads in regions of interest, and would
>> need bam files but would manage data input for you.
>>
>> Martin
>>
>>> Cheers../Murli
>>>
>>>
>>> -- output of sessionInfo():
>>>
>>> R version 2.15.0 (2012-03-30)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5
>>> [4] Rsamtools_1.8.5 lattice_0.20-6 Biostrings_2.24.1
>>> [7] GenomicRanges_1.8.7 IRanges_1.14.4 BiocGenerics_0.2.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0 hwriter_1.3
>>> stats4_2.15.0
>>> [6] tools_2.15.0 zlibbioc_1.2.0
>>>
>>> --
>>> Sent via the guest posting facility at bioconductor.org.
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
-------------- next part --------------
library(GenomicRanges)
.novoAlign_input <-
function(fl, filt, gal, ...)
{
## input
col.names <- c("qname", "type", "qseq", "qual", "status", "maps",
"mapq", "rname", "pos", "strand", "mrnm", "mpos",
"mstrand", "mismatches")
df <- read.table(fl, sep="\t", fill=TRUE, col.names=col.names,
stringsAsFactors=FALSE, ...)
## filter and coerce to GappedAlignments object
gal(filt(df))
}
.novoAlign_filter <-
function(df, ...)
{
## keep only uniquely aligning pairs on the forward or reverse strand
keep <- (df$status == "U") & (df$strand %in% c("F", "R"))
df[keep,]
}
.novoAlign_GappedAlignments <-
function(df, ...)
{
## coerce some information to GappedAlignments object
with(df, {
GappedAlignments(seqnames=rname, pos=pos,
## FIXME: not correct for indels
cigar=paste0(nchar(qseq), "M"),
strand=strand(c(F="+", R="-")[strand]),
names=qname,
qseq=DNAStringSet(qseq),
qual=PhredQuality(qual))
})
}
.novoAlign_GappedAlignmentPairs <-
function(first, last, ...)
{
## create GappedAlignmentPairs
nms <- union(as.character(seqnames(first)),
as.character(seqnames(last)))
seqlevels(first) <- seqlevels(last) <- nms
isProperPair <- rep(TRUE, length(first))
GappedAlignmentPairs(first, last, isProperPair)
}
fls <- dir(pattern="novoOutput$")
pairs <- lapply(fls, .novoAlign_input, .novoAlign_filter,
.novoAlign_GappedAlignments)
## FIXME: is this matching correct?
o <- match(names(pairs[[1]]), names(pairs[[2]]))
galp <- .novoAlign_GappedAlignmentPairs(pairs[[1]][!is.na(o)],
pairs[[2]][o[!is.na(o)]])
More information about the Bioconductor
mailing list