[BioC] Reading Paired End Native Report Format in ShortRead

Martin Morgan mtmorgan at fhcrc.org
Thu Jul 12 20:58:01 CEST 2012


On 07/12/2012 11:42 AM, Murli Nair [guest] wrote:
>
> Hi,
>
> I am trying to read the alignments generated using NovoAlign. The format I have the data is Paired End Native Report Format(http://computing.bio.cam.ac.uk/local/doc/NovoCraftV2.06.pdf).
> What is the most efficient way to read this data into ShortRead? Since it is paired end data I have two files corresponding to the two sides.
> I tried without success using the different formats using readAligned(). I also read an earlier posting about it which suggests to convert it to SAM format.
> I would appreciate your suggestions.

 From the document you reference

Three output formats are provided.

1. Native

2. Extended Native

3. Pairwise

4. SAM

If Paired End Native Report Format is 1 or 2 with a single record per 
line then I believe the only support for input would be as tab-delimited 
files (read.table and friends; these are flexible and could easily be 
used to iterate through a large file in a memory efficient way); you 
would then use an appropriate constructor, e.g., 
GenomicRanges::GappedAlignmentPairs, to create an object that you could 
manipulate. Format 3 looks challenging to parse.

Generally, for aligned reads aim for BAM files, which is output format 4 
followed by using Rsamtools or other with asBam, sortBam, indexBam to 
create a sorted bam file and index. use 
GenomicRanges::readGappedAlignmentPairs for many paired-end tasks.

It might help to think a little further ahead about what you want to do, 
e.g., GenomicRanges::summarizeOverlaps would be useful in RNAseq 
differential expression to count reads in regions of interest, and would 
need bam files but would manage data input for you.

Martin

> Cheers../Murli
>
>
>   -- output of sessionInfo():
>
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.14.4    latticeExtra_0.6-19 RColorBrewer_1.0-5
> [4] Rsamtools_1.8.5     lattice_0.20-6      Biostrings_2.24.1
> [7] GenomicRanges_1.8.7 IRanges_1.14.4      BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0    hwriter_1.3    stats4_2.15.0
> [6] tools_2.15.0   zlibbioc_1.2.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list