[BioC] Filtering by variance, IQR, etc.

Wed Apr 11 03:24:30 CEST 2007

A little over a week ago I posed the question of what effect, if any, 
the method of filtering out genes based on low levels of variance (as 
measured by IQR) has on the calculation of the FDR. I appreciated 
Jenny's comments (below) and read over the posts by Robert Gentleman and 
others of Feb, 2006 (see Jenny's note far below).

Not being a theoritician, I decided to pursue some simulations of my own 
to better understand the process and have discovered, to my chagrin, 
that filtering by IQR does seem to cause a disturbance to the p value 
distribution and an under-reporting of the true FDR.

Immediately below is are the results I obtained with my simulations and 
below that is a script, with functions, which can be used to duplicate 
my results in R. I would ask that interested parties review my script 
and results carefully and comment on my methods and my conclusions that 
it is perhaps better not to do the IQR filtering, but to accept the fact 
that if we do not do the filtering, we just have to live with a higher 
FDR. I am actually hoping that someone will prove me wrong, but I do 
believe that I am correct.

I look forward to more healthy debate on this topic.

Mark

This output is tab-delimited and will look fine if pasted into Excel:

parameter	mean	std. dev		mean	std. dev		mean	std. dev		mean	std. dev
number of simulation runs	20.00	0.00		20.00	0.00		20.00	0.00		20.00	0.00
initial number of genes	30000.00	0.00		30000.00	0.00		30000.00	0.00	 
30000.00	0.00
initial percentage of significant genes	0.00	0.00		0.00	0.00	 
3.00	0.00		3.00	0.00
true number of significant genes	0.00	0.00		0.00	0.00		900.00	0.00	 
900.00	0.00
initial seeded fold change min.	1.00	0.00		1.00	0.00		1.10	0.00		1.10	0.00
initial seeded fold change max	1.00	0.00		1.00	0.00		2.25	0.00		2.25	0.00
number of arrays	12.00	0.00		12.00	0.00		12.00	0.00		12.00	0.00
initial mean expression	1000.04	0.20		1000.04	0.28		1010.06	0.19	 
1010.15	0.27
initial c.v.	0.13	0.00		0.13	0.00		0.13	0.00		0.13	0.00
initial actual fold change min.	1.00	0.00		1.00	0.00		1.00	0.00		1.00	0.00
initial actual fold change max.	1.39	0.03		1.40	0.03		2.68	0.09		2.66	0.08
IQR percentile of filter	0.00	0.00		25.00	0.00		0.00	0.00		25.00	0.00
post-fitler number of genes	30000.00	0.00		22501.00	0.00	 
30000.00	0.00		22501.00	0.00
post-filter mean expression	1000.04	0.20		998.47	0.31		1010.06	0.19	 
1011.91	0.27
post-filter c.v.	0.13	0.00		0.13	0.00		0.13	0.00		0.14	0.00
post-filter fold change min.	1.00	0.00		1.00	0.00		1.00	0.00		1.00	0.00
post-filter fold change max.	1.39	0.03		1.40	0.03		2.68	0.09		2.66	0.08
Storey q value FDR	0.20	0.00		0.20	0.00		0.20	0.00		0.20	0.00
sig. genes number	0.30	0.47		0.40	0.75		923.05	16.95		980.85	23.10
sig. genes mean expression	NA	NA		NA	NA		1309.78	5.56		1292.99	6.02
sig. genes c.v.	NA	NA		NA	NA		0.28	0.00		0.27	0.00
sig. genes fold change min.	NA	NA		NA	NA		1.10	0.01		1.12	0.01
sig. genes fold change max.	NA	NA		NA	NA		2.68	0.09		2.66	0.08
sig. genes true FDR	NA	NA		NA	NA		0.19	0.01		0.23	0.02
sig. genes true FNR	NA	NA		NA	NA		0.17	0.01		0.14	0.01

#Script to simulate the effects of IQR filtering on the FDR and positive 
gene characteristics
#  of a simulated dataset

##################################################################################################

#################################################################################################
# describe perform t-testing with FDR correction, make histogram of p 
values,
# determine true FDR, look at Fold Change (FC) characteristics of 
significant genes under
# 4 different conditions: 1. no induced FC, no filtering 2. no induced 
FC, IQR filtering,
# 3. induced FC, no IQR filtering, 4. induced FC, yes IQR filtering
#################################################################################################
#Input parameters
#num.sim.genes = 30000  #number of simulated genes on a array
#num.arrays = 12 #number of simulated arrays, must be even with n1=n2 
for this 2 group simulation
#mean.exprs = 1000 #mean simulated gene expression, will be normally 
distributed
#coeff.var= 0.13, #coefficient of variation within and between genes
#FC.range = c(1,1) #uniform range of fold change between groups
#percent.genes.real.change = 0 #percentage of genes that have a 
simulated fold change
#IQR.filt.per = 0 #percentage of genes to apply the IQR filter to, if 
any (zero means no genes filtered)
#FDR.cutoff = 0.20 #BH FDR cutoff level for significance
#num.runs = 20 #number of times the simulation is run to determine the 
std. dev. for each output value
#################################################################################################
#Output: a .pdf file of the p value distribution for the last simulation 
run and a tab-delim text file with .cvs file
	#extension (works with OpenOffice, you might want to change) that 
contains the following rows
#Initial State
#"number of simulation runs"  #number of total simulations run for a 
given set of input parameters
#"initial number of genes"  #how many simulated genes on each array
#"initial percentage of significant genes"  #percentage of genes with 
simulated spike-in fold changes
#"true number of significant genes"  #number of genes with simulated 
spike-in fold changes
#"initial seeded fold change min."  #the minimum spiked-in fold change
#"initial seeded fold change max"   #the maximum spiked-in fold change
#"number of arrays"  #number of simulated arrays
#"initial mean expression"   #initial mean expression level
#"initial c.v."   #initial mean within-gene coeffecient of variation
#"initial actual fold change min."  #initial actual minimum fold change
#"initial actual fold change max."   #initial actual maximum fold change
#Post-filtered state
#"IQR percentile of filter"   #percentage of the IQR distribution below 
which genes will be filtered out
#"post-fitler number of genes"   #number of remaining genes after IQR 
filtration
#"post-filter mean expression"   #mean expression of genes remaining 
after filtration
#"post-filter c.v."   #mean c.v. of genes remaining after filtration
#"post-filter fold change min."  #post-filtration actual minimum fold change
#"post-filter fold change max."   #post-filtration actual maximum fold 
change
#Significant Genes
#"BH value FDR"   #set FDR threshold
#"sig. genes number"   #total number of genes found significant
#"sig. genes mean expression"   #mean expression of significant genes
#"sig. genes c.v."   #mean with-gene c.v. of signficant genes
#"sig. genes fold change min."   #minimum fold change of significant genes
#"sig. genes fold change max."   #maximum fold change of significant genes
#"sig. genes true FDR"   # the true FDR
#"sig. genes true FNR"   #the true FNR

##############################################################
#LOAD THE FUNCTIONS BELOW BEFORE RUNNING THE FOLLOWING SIMULATIONS

IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000, 
coeff.var= 0.13,
    FC.range = c(1,1), IQR.filt.per = 0, FDR.cutoff = 0.20, 
percent.genes.real.change = 0, num.runs = 20)

IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000, 
coeff.var= 0.13,
    FC.range = c(1,1), IQR.filt.per = 25, FDR.cutoff = 0.20, 
percent.genes.real.change = 0, num.runs = 20)

IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000, 
coeff.var= 0.13,
    FC.range = c(1.1, 2.25), IQR.filt.per = 0, FDR.cutoff = 0.20, 
percent.genes.real.change = 3, num.runs = 20)

IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000, 
coeff.var= 0.13,
    FC.range = c(1.1, 2.25), IQR.filt.per = 25, FDR.cutoff = 0.20, 
percent.genes.real.change = 3, num.runs = 20)

##############################################################
##############################################################
##############################################################
###
###   END OF SIMULATION
###
##############################################################
##############################################################
##############################################################

#################################################################################################
# SIMULATION FUNCTIONS
#################################################################################################

IQR.sim.func <- function(num.sim.genes, num.arrays, mean.exprs, coeff.var,
    FC.range, IQR.filt.per, FDR.cutoff, percent.genes.real.change, num.runs)

   {

#a couple of utility functions to use later on

#fold change calculator
abs.fc.calc.func <- function(m1,m2){
   fc <- mean(m2)/mean(m1); if (fc < 1) {fc <- 1/fc} ; round(fc, digits 
= 2); fc}

#coefficient of variation calculator
cv.func <- function(vec){cv <- sd(vec)/mean(vec); cv}

for (run.num in 1:num.runs)

   {
##################################################################################################
#construct matrix of random normal numbers

cv <- coeff.var

mat <-abs(matrix(rnorm((num.sim.genes * num.arrays), mean = mean.exprs,
                    sd = (mean.exprs * cv)), nrow = num.sim.genes, ncol 
= num.arrays))

# "Seed" the matrix with genes with uniform dist. of fold changes of 
range min.change-max change

per.seed <- percent.genes.real.change #percentage of genes seeded

genes.seed <- (per.seed/100) * num.sim.genes #number of genes seeded

#marker for which genes are "true positives"(TRUE) and which are "true 
negatives"(FALSE)
genes.seed.logical <- c(rep(TRUE, genes.seed), rep(FALSE,(num.sim.genes 
- genes.seed)))

min.change <- FC.range[1]

max.change <- FC.range[2]

sim.FC <- seq(from = min.change, to = max.change, length.out = genes.seed)

if (genes.seed != 0)

   {

     for (i in 1:genes.seed) # Do seeding

       {
         mat[i,1:(num.arrays/2)] <- sim.FC[i] * mat[i, 1:(num.arrays/2)]}
   }

# Characteristics of all genes

mean.exprs.all <- mean(mat)

mean.cv.exprs.all <-mean(apply(mat, 1, cv.func))

fc.vec.all <- rep(NA, nrow(mat))

for (i in 1:nrow(mat)){fc.vec.all[i] <- 
abs.fc.calc.func(mat[i,1:(num.arrays/2)], mat[i,((num.arrays/2) + 
1):num.arrays])}

range.fc.all <- quantile(fc.vec.all, p = seq(0,1))

#IQR filtering

IQR.cutoff <- IQR.filt.per  #the percentile of IQR below which will be 
filtered out

IQR.vec <- apply(log2(mat), 1, IQR) #compute IQR for each row of mat 
("gene"), needs to be log2

IQR.per <- 100*rank(IQR.vec)/length(IQR.vec) #compute IQR percentile for 
each "gene"

IQR.filt <- IQR.per >= IQR.cutoff #T/F filter

mat <- mat[IQR.filt,]

genes.seed.logical <- genes.seed.logical[IQR.filt]

# Characteristics of post-filtered genes

mean.exprs.filtered <- mean(mat)

mean.cv.exprs.filtered <- mean(apply(mat, 1, cv.func))

fc.vec.filtered <- rep(NA, nrow(mat))

for (i in 1:nrow(mat)){fc.vec.filtered[i] <-
    abs.fc.calc.func(mat[i,1:(num.arrays/2)], mat[i,((num.arrays/2) + 
1):num.arrays])}

range.fc.filtered <- quantile(fc.vec.filtered, p = seq(0,1))

# Do t-testing

p.vec <- rep(NA, nrow(mat))

for (i in 1:length(p.vec))

   {

     p.vec[i] <- t.test(x = mat[i,1:(num.arrays/2)], y = 
mat[i,((num.arrays/2) + 1):num.arrays],
             alternative = "two.sided", var.equal = TRUE)$p.value

   }

hist(p.vec, breaks = 100, main=paste("Histogram of p values. 
IQR.filt.per = ", IQR.filt.per,
      ". FC.range is ", FC.range[1], " to ", FC.range[2], sep="") 
,xlab="p value")

require(geneplotter, quiet=T)

savepdf(fn = paste("hist.p.values.IQR.filt.per", IQR.filt.per,
      "FC.range.", FC.range[1], FC.range[2], sep="_"), dir = getwd())

# FDR correction using BH method

p.adj.vec <- p.adjust(p.vec, method = "BH")

sig.filt <- p.adj.vec <= FDR.cutoff

tot.sig <- sum(sig.filt) #total of sig. genes

tot.true.neg.sig <- sum(sig.filt & !genes.seed.logical) #true negatives 
in genes called sig.

true.FDR <- tot.true.neg.sig/tot.sig

tot.true.pos.not.sig <- sum(!sig.filt & genes.seed.logical) #true 
positives in genes called not sig.

true.FNR <- tot.true.pos.not.sig/sum(genes.seed.logical)

#characteristics of total positive genes

if (sum(sig.filt) > 1)

   {

     mat <- mat[sig.filt,]

     mean.exprs.sig <- mean(mat)

     mean.cv.exprs.sig <- mean(apply(mat, 1, cv.func))

     fc.vec.sig <- rep(NA, nrow(mat))

     for (i in 1:nrow(mat)){fc.vec.sig[i] <-
         abs.fc.calc.func(mat[i,1:(num.arrays/2)], mat[i,((num.arrays/2) 
+ 1):num.arrays])}

     range.fc.sig <- quantile(fc.vec.sig, p = seq(0,1))

   } else {

     mean.exprs.sig <- NA; mean.cv.exprs.sig <- NA; range.fc.sig 
<-rep(NA,2)}

#############################################################
############################################################
# Create output datastructure

if (run.num == 1)
   {
parameter <- c(
#Initial State
"number of simulation runs",
"initial number of genes",
"initial percentage of significant genes",
"true number of significant genes",
"initial seeded fold change min.",
"initial seeded fold change max",
"number of arrays",
"initial mean expression",
"initial c.v.",
"initial actual fold change min.",
"initial actual fold change max.",
#Post-filtered state
"IQR percentile of filter",
"post-fitler number of genes",
"post-filter mean expression",
"post-filter c.v.",
"post-filter fold change min.",
"post-filter fold change max.",
#Significant Genes
"BH FDR",
"sig. genes number",
"sig. genes mean expression",
"sig. genes c.v.",
"sig. genes fold change min.",
"sig. genes fold change max.",
"sig. genes true FDR",
"sig. genes true FNR")

out.mat <- matrix(ncol = 1, nrow = length(parameter))

}

#Create output of i and bind to out.mat
output.vec <-
   c(num.runs,
num.sim.genes,
per.seed,
(num.sim.genes*per.seed/100),
FC.range[1],
FC.range[2],
num.arrays,
mean.exprs.all,
mean.cv.exprs.all,
range.fc.all[1],
range.fc.all[2],

#Post-filtered state
IQR.filt.per,
sum(IQR.filt),
mean.exprs.filtered,
mean.cv.exprs.filtered,
range.fc.filtered[1],
range.fc.filtered[2],

#Significant Genes
FDR.cutoff,
tot.sig,
mean.exprs.sig,
mean.cv.exprs.sig,
range.fc.sig[1],
range.fc.sig[2],
true.FDR,
true.FNR)

if (run.num ==1){out.mat[,1] <- output.vec} else {out.mat <- 
cbind(out.mat, output.vec)}

print(paste("finished run ", run.num, sep = ""))

}

final.out.mat <- matrix(nrow = length(parameter), ncol = 3)

final.out.mat[,1] <- parameter

colnames(final.out.mat) <- c("parameter", "mean", "std. dev")

for (i in 1:nrow(final.out.mat))

     {

       err.out <- try(out.tmp <- sd(out.mat[i,]), TRUE)

       if(inherits(err.out, "try-error"))

        {final.out.mat[i,2:3]<-rep(NA,2)} else

          {final.out.mat[i,2:3]<-c(mean(out.mat[i,]), sd(out.mat[i,]))}
      }

write.table(final.out.mat, "simulation.output.csv", sep="\t", 
col.names=T, row.names=F, append=T)

}

##############################################################
##############################################################
##############################################################
###
###
###
###   END OF R Functions
###
###
###
##############################################################
##############################################################
##############################################################

Jenny Drnevich wrote:
> Hi Mark,
> 
> I also have been suspicious of filtering out low-variance genes. There 
> are 1-2 issues, depending on how you calculate the test statistic. The 
> first is the possible distortion of FDR calculations. On page 234 of the 
> BioC book you mention, Scholtens & Heydebreck state: "If the truly 
> differentially expressed genes are overrepresented among those selected 
> in the filtering step, the FDR associated with a certain threshold of 
> the test statistic will be lowered due to filtering."  I believe this is 
> what your colleague was complaining about, as I can't see how filtering 
> based on variance could do anything except overrepresent DEG. But - is 
> this a good or bad thing?
> 
> The second issue may not matter depending on how you calculate the test 
> statistic. If you use limma, which uses a Bayesian shrinkage based on 
> the variance of all genes, filtering out low-variance genes will lead to 
> a higher average gene variance, and hence less shrinkage. Scholtens & 
> Heydebreck also point this out on pg. 234: "Also concerning the 
> /variability across samples/, a higher overall variance of the 
> differentially expressed genes may be expected, because their 
> between-class variance adds to their within-class variance."  I've 
> looked at this in some of my data sets, and the t values calculated by 
> eBayes() after filtering are generally lower than they were before 
> filtering. Now, the boost to FDR correction for a fewer number of genes 
> tends to cancel out the lower p-values, but not always.
> 
> I also would be interested in any theoretical papers on this that 
> confirm or rebuke my intuitions. I myself have no problem doing a 
> conservative filtering to remove genes that likely aren't expressed in 
> any of the samples, but I don't filter based on variance because I use 
> the Bayesian correction in limma. There was a lengthy exchange on this 
> top in Feb 2006 (subject: Invalid fold-filter). Robert Gentleman 
> https://stat.ethz.ch/pipermail/bioconductor/2006-February/011988.html 
> said you can do simulations to convince yourself that filtering on 
> variance doesn't really bias the results, at least in regards to the 
> first issue I mention. The effects on the Bayesian shrinkage I don't 
> think have been addressed...
> 
> Cheers,
> Jenny
> 
> At 11:36 PM 4/2/2007, Mark W Kimpel wrote:
>> I have been using what I consider to be non-biased filtering of
>> low-variance genes using the method described in "Bioinformatics and
>> Computational Biology Solutions using R and Bioconductor", R. Gentleman,
>> et al., page 233 for some time and have recently run into some
>> resistance from a colleague who claims that this type of filtering
>> distorts FDR calculations because it introduces bias. His reasoning is
>> that, since this method tends to filter out genes with higher p values
>> and/or lower fold changes, that it is sort of a sneaky way of
>> accomplishing just that. Of course, filtering by phenotype does
>> introduce bias, but in this case I believe that by filtering based on
>> the a priori assumption that we just aren't that interested in low
>> variance genes for biologic reasons (even if statistically significant
>> they will have very low fold changes and thus be of questionable
>> meaning) that we aren't violating the statistical underpinnings of the
>> analysis.
>>
>> I need some help in justifying this filtering step. Does anyone know of
>> a peer-reviewed reference that gives a theoretical justification for its
>> use of of any empiric experiments that show that it is legit?
>>
>> Thanks,
>> Mark
>> -- 
>> Mark W. Kimpel MD
>> Neuroinformatics
>> Department of Psychiatry
>> Indiana University School of Medicine
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor 
> 
> Jenny Drnevich, Ph.D.
> 
> Functional Genomics Bioinformatics Specialist
> W.M. Keck Center for Comparative and Functional Genomics
> Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
> 
> 330 ERML
> 1201 W. Gregory Dr.
> Urbana, IL 61801
> USA
> 
> ph: 217-244-7355
> fax: 217-265-5066
> e-mail: drnevich at uiuc.edu
> 

-- 
Mark W. Kimpel MD
Neuroinformatics
Department of Psychiatry
Indiana University School of Medicine