[BioC] R for normalizing gene length for next-gene sequencing data

Sat Apr 23 03:18:51 CEST 2011

Hi,

On Fri, Apr 22, 2011 at 7:49 PM, Andrew Wang
<andrew.wang.2010.2011 at gmail.com> wrote:
> Hello, everyone
>
> I am wondering how to use R packages to generate a count table with
> samples as columns and tags as rows. In addition, how to normalize
> the counts to the length of each gene. That is, all gene counts
> should be normalized from 0 to 1 in gene length and then draw a
> distribution of counts. Finally, how to access these R objects that
> store these data and to manipulate them using R commands/scripts. Thanks.

You will want to get very comfortable with the following packages:

* IRanges and GenomicRanges

Use the data structures in these packages (IRanges or GRanges) to
store and manipulate your reads.

* GenomicFeatures

Provides functionality to access gene/transcript info from different
annotation sources (refseq, ucsc, etc) and exposes them as GRanges
objects. This makes it easy to quantify which reads overlap which
genes/exons/etc (assuming you are storing you reads in I/GRanges
objects (use GRanges))

* Maybe Rsamtools to query your BAM files and load them into
appropriate data structures

Reads through the vignettes in these packages

You will be able to do all the things you are asking for once you get
comfortable with the three packages above.

Also

* The Biostrings and BSgenome.* packages will be your friends.

Read through this stuff, too:

http://www.bioconductor.org/help/workflows/high-throughput-sequencing/

Tutorial/course material here:

http://www.bioconductor.org/help/course-materials/2010/EMBL2010/

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact