[BioC] Estimating Size Factors in DESeq Package

Fri Sep 7 17:05:34 CEST 2012

Good night Dr. Simon Anders,

I'm a master degree student at Universidad de los Andes (Bogotá, Colombia) currently doing some research on differential gene expression between three conditions (2 treated and 1 untreated). I'm using DESeq on R to compare the FPKMs of the genes under these three conditions with one replicate per condition.

The results were obtained by single-end RNA-seq data and the expressed genes were aligned to a reference genome using Bowtie. Cufflinks and, subsequently, CuffCompare were used to obtain the reads mapped in a given sample to a given gene.

Overall, I organized my tables quite similar to the pasilla_gene_counts.tsv file you used in your guide "Analyzing RNA-seq Data with the DESeq Package" from 2012. At first I had two tables: one per replicate containing FPKMs for the three conditions analyzed. I then merged the FPKMs of both files to obtain three tables; each one now with the FPKMs for both replicates compared between two of the three conditions (TreatedA vs Untreated; TreatedB vs Untreated; TreatedA vs TreatedB). The final format for each of these tables was as follows:

Gene     Treated1(Replicate1)   Treated2(Replicate 2)    Untreated1(Replicate1)     Untreated2(Replicate2)

tag_id          FPKM                              FPKM                                      FPKM                                 FPKM

tag_id          FPKM                              FPKM                                      FPKM                                 FPKM

...

I understand that DESeq must first normalize the expression values of each treatment by dividing each column with it's own size factor... however, when I want to estimate the size factors for any of my three final tables I get a "NA" or "Not Applicable" value for each treatment. It only happens with these tables that include both replicates but not with the two previous tables that only contain information for one replicate (results are attached).

We don't know what might be causing this problem because the tables that contain information for one replicate have the same format than the tables that have both replicates (and in the  example you use a table that contains replicates).

I was planning to make two separate analysis using DESeq without any replicates (Section 3.3 of your guide), in spite of this problem, but I read that one must assume gene expression levels are quite similar between treatments (this is not our case). The idea I had in mind was to perform one analysis for each replicate, separately, and then compare the results to pick only genes that show differential expression on both analysis. Is this right, even though we know the expression levels vary between conditions?

What could be the cause of the "NA" output when trying to estimate the size factors for the tables that contain both replicates?

Can I analyze three treatments at once (all in one table) using DESeq, or only two conditions per table (analysis)?

We would appreciate your help on this matter a lot because we do want to continue using DESeq for our differential gene expression analysis.

Sincerely,

Andrés E. Rodríguez C.
Graduate Assistant
LAMFU - Universidad de los Andes
(Bogotá D.C., Colombia)