[BioC] Can the normalization factors be too far apart in EdgeR analyses?

Sat Jul 21 17:14:47 CEST 2012

Hi Jason,

Some comments below.

On 20.07.2012, at 23:16, Hoskins, Jason (NIH/NCI) [F] wrote:

> Hello,
> 
> I have RNA-seq data from 10 normal samples and 8 tumor samples, which I am using edgeR to analyze for differential expression (DE) between the tumors and the normals.  I have basically followed the workflow in the edgeR user's guide section 3.3.  It is known that there is a large RNA compositional bias in these normal tissue samples (i.e. the top 25 genes by raw counts account for 50-80% of the total reads), which is not present in the tumor samples, so normalization via edgeR's calcNormFactors() is presumably very important.  The results from the calcNormFactors() is printed below with anonymous samples.
> 
>                                 group                   lib.size                  norm.factors
> Sample1               Normals               136765371           1.0567240
> Sample2               Normals               116803340           0.5898912
> Sample3               Normals               88783007             0.5880073
> Sample4               Normals               314426955           0.6871909
> Sample5               Normals               289961788           0.5574136
> Sample6               Normals               296455983           0.3413478
> Sample7               Normals               260923863            0.7353922
> Sample8               Normals               118870482           0.7742314
> Sample9               Normals               237556345           0.5113664
> Sample10            Normals               126493394           0.3916818
> Sample11            Tumors                 90611059             1.7934781
> Sample12            Tumors                 93423641             2.0290747
> Sample13            Tumors                 122360083           1.9691099
> Sample14            Tumors                 80575136             1.9405350
> Sample15            Tumors                104183711           1.7019891
> Sample16            Tumors                 112372313           2.0484955
> Sample17            Tumors                 102789103           1.8569770
> Sample18            Tumors                 96733614             2.0323221
> 
> My first question is what is used as the reference in the default TMM method's calculation of the normalization factors?  The user's guide and other documentation claims that the reference is "the sample whose 75%-ile (of library-scale-scaled counts) is closest to the mean of 75%-iles."  Presumably the normalization factor for the reference sample should be 1.0, but none of my samples have a normalization factor of 1.0 (closest is sample 1 with 1.0567240).

Read a bit further in the docs and it says:

"For symmetry, normalization factors are adjusted to multiply to 1.
 The effective library size is then the original library size
 multiplied by the scaling factor."

That's why there is no sample with factor=1.

> My second question is should I be concerned about the large variation in normalization factors among the normals group, and the even larger difference in normalization factors between the normals and the tumors?  I guess it's not all that surprising that the normalization factors are very different between normals and tumors given the huge compositional bias in the normal samples, but is the TMM method robust enough to handle these differences?

It's tough to know whether to be concerned based on these numbers alone.  I suggest having a look at some pairwise MA-plots, both within normals, within cancers and between.  Sample6 versus Sample16, for example, is the most extreme.  I will say that these are amongst the most extreme that I've seen, but it really depends on the data.

>  Is TMM the best method for this type of normalization?

Questions regarding what method is "best" are not easy to answer and often dataset-dependent.  TMM is good at what it does: removing a systematic bias between samples.  It doesn't account for everything (e.g. sample-specific GC content effects), so if your data exhibits these, consider looking at BioC packages cqn and EDASeq.

Best,
Mark

> 
> Thanks for your help!
> 
> -Jason
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor