[BioC] Can the normalization factors be too far apart in EdgeR analyses?
Mark Robinson
mark.robinson at imls.uzh.ch
Sat Jul 21 17:14:47 CEST 2012
Hi Jason,
Some comments below.
On 20.07.2012, at 23:16, Hoskins, Jason (NIH/NCI) [F] wrote:
> Hello,
>
> I have RNA-seq data from 10 normal samples and 8 tumor samples, which I am using edgeR to analyze for differential expression (DE) between the tumors and the normals. I have basically followed the workflow in the edgeR user's guide section 3.3. It is known that there is a large RNA compositional bias in these normal tissue samples (i.e. the top 25 genes by raw counts account for 50-80% of the total reads), which is not present in the tumor samples, so normalization via edgeR's calcNormFactors() is presumably very important. The results from the calcNormFactors() is printed below with anonymous samples.
>
> group lib.size norm.factors
> Sample1 Normals 136765371 1.0567240
> Sample2 Normals 116803340 0.5898912
> Sample3 Normals 88783007 0.5880073
> Sample4 Normals 314426955 0.6871909
> Sample5 Normals 289961788 0.5574136
> Sample6 Normals 296455983 0.3413478
> Sample7 Normals 260923863 0.7353922
> Sample8 Normals 118870482 0.7742314
> Sample9 Normals 237556345 0.5113664
> Sample10 Normals 126493394 0.3916818
> Sample11 Tumors 90611059 1.7934781
> Sample12 Tumors 93423641 2.0290747
> Sample13 Tumors 122360083 1.9691099
> Sample14 Tumors 80575136 1.9405350
> Sample15 Tumors 104183711 1.7019891
> Sample16 Tumors 112372313 2.0484955
> Sample17 Tumors 102789103 1.8569770
> Sample18 Tumors 96733614 2.0323221
>
> My first question is what is used as the reference in the default TMM method's calculation of the normalization factors? The user's guide and other documentation claims that the reference is "the sample whose 75%-ile (of library-scale-scaled counts) is closest to the mean of 75%-iles." Presumably the normalization factor for the reference sample should be 1.0, but none of my samples have a normalization factor of 1.0 (closest is sample 1 with 1.0567240).
Read a bit further in the docs and it says:
"For symmetry, normalization factors are adjusted to multiply to 1.
The effective library size is then the original library size
multiplied by the scaling factor."
That's why there is no sample with factor=1.
> My second question is should I be concerned about the large variation in normalization factors among the normals group, and the even larger difference in normalization factors between the normals and the tumors? I guess it's not all that surprising that the normalization factors are very different between normals and tumors given the huge compositional bias in the normal samples, but is the TMM method robust enough to handle these differences?
It's tough to know whether to be concerned based on these numbers alone. I suggest having a look at some pairwise MA-plots, both within normals, within cancers and between. Sample6 versus Sample16, for example, is the most extreme. I will say that these are amongst the most extreme that I've seen, but it really depends on the data.
> Is TMM the best method for this type of normalization?
Questions regarding what method is "best" are not easy to answer and often dataset-dependent. TMM is good at what it does: removing a systematic bias between samples. It doesn't account for everything (e.g. sample-specific GC content effects), so if your data exhibits these, consider looking at BioC packages cqn and EDASeq.
Best,
Mark
>
> Thanks for your help!
>
> -Jason
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list