[BioC] Filtering by variance, IQR, etc.
Mark W Kimpel
mwkimpel at gmail.com
Wed Apr 11 03:24:30 CEST 2007
A little over a week ago I posed the question of what effect, if any,
the method of filtering out genes based on low levels of variance (as
measured by IQR) has on the calculation of the FDR. I appreciated
Jenny's comments (below) and read over the posts by Robert Gentleman and
others of Feb, 2006 (see Jenny's note far below).
Not being a theoritician, I decided to pursue some simulations of my own
to better understand the process and have discovered, to my chagrin,
that filtering by IQR does seem to cause a disturbance to the p value
distribution and an under-reporting of the true FDR.
Immediately below is are the results I obtained with my simulations and
below that is a script, with functions, which can be used to duplicate
my results in R. I would ask that interested parties review my script
and results carefully and comment on my methods and my conclusions that
it is perhaps better not to do the IQR filtering, but to accept the fact
that if we do not do the filtering, we just have to live with a higher
FDR. I am actually hoping that someone will prove me wrong, but I do
believe that I am correct.
I look forward to more healthy debate on this topic.
Mark
This output is tab-delimited and will look fine if pasted into Excel:
parameter mean std. dev mean std. dev mean std. dev mean std. dev
number of simulation runs 20.00 0.00 20.00 0.00 20.00 0.00 20.00 0.00
initial number of genes 30000.00 0.00 30000.00 0.00 30000.00 0.00
30000.00 0.00
initial percentage of significant genes 0.00 0.00 0.00 0.00
3.00 0.00 3.00 0.00
true number of significant genes 0.00 0.00 0.00 0.00 900.00 0.00
900.00 0.00
initial seeded fold change min. 1.00 0.00 1.00 0.00 1.10 0.00 1.10 0.00
initial seeded fold change max 1.00 0.00 1.00 0.00 2.25 0.00 2.25 0.00
number of arrays 12.00 0.00 12.00 0.00 12.00 0.00 12.00 0.00
initial mean expression 1000.04 0.20 1000.04 0.28 1010.06 0.19
1010.15 0.27
initial c.v. 0.13 0.00 0.13 0.00 0.13 0.00 0.13 0.00
initial actual fold change min. 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00
initial actual fold change max. 1.39 0.03 1.40 0.03 2.68 0.09 2.66 0.08
IQR percentile of filter 0.00 0.00 25.00 0.00 0.00 0.00 25.00 0.00
post-fitler number of genes 30000.00 0.00 22501.00 0.00
30000.00 0.00 22501.00 0.00
post-filter mean expression 1000.04 0.20 998.47 0.31 1010.06 0.19
1011.91 0.27
post-filter c.v. 0.13 0.00 0.13 0.00 0.13 0.00 0.14 0.00
post-filter fold change min. 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00
post-filter fold change max. 1.39 0.03 1.40 0.03 2.68 0.09 2.66 0.08
Storey q value FDR 0.20 0.00 0.20 0.00 0.20 0.00 0.20 0.00
sig. genes number 0.30 0.47 0.40 0.75 923.05 16.95 980.85 23.10
sig. genes mean expression NA NA NA NA 1309.78 5.56 1292.99 6.02
sig. genes c.v. NA NA NA NA 0.28 0.00 0.27 0.00
sig. genes fold change min. NA NA NA NA 1.10 0.01 1.12 0.01
sig. genes fold change max. NA NA NA NA 2.68 0.09 2.66 0.08
sig. genes true FDR NA NA NA NA 0.19 0.01 0.23 0.02
sig. genes true FNR NA NA NA NA 0.17 0.01 0.14 0.01
#Script to simulate the effects of IQR filtering on the FDR and positive
gene characteristics
# of a simulated dataset
##################################################################################################
#################################################################################################
# describe perform t-testing with FDR correction, make histogram of p
values,
# determine true FDR, look at Fold Change (FC) characteristics of
significant genes under
# 4 different conditions: 1. no induced FC, no filtering 2. no induced
FC, IQR filtering,
# 3. induced FC, no IQR filtering, 4. induced FC, yes IQR filtering
#################################################################################################
#Input parameters
#num.sim.genes = 30000 #number of simulated genes on a array
#num.arrays = 12 #number of simulated arrays, must be even with n1=n2
for this 2 group simulation
#mean.exprs = 1000 #mean simulated gene expression, will be normally
distributed
#coeff.var= 0.13, #coefficient of variation within and between genes
#FC.range = c(1,1) #uniform range of fold change between groups
#percent.genes.real.change = 0 #percentage of genes that have a
simulated fold change
#IQR.filt.per = 0 #percentage of genes to apply the IQR filter to, if
any (zero means no genes filtered)
#FDR.cutoff = 0.20 #BH FDR cutoff level for significance
#num.runs = 20 #number of times the simulation is run to determine the
std. dev. for each output value
#################################################################################################
#Output: a .pdf file of the p value distribution for the last simulation
run and a tab-delim text file with .cvs file
#extension (works with OpenOffice, you might want to change) that
contains the following rows
#Initial State
#"number of simulation runs" #number of total simulations run for a
given set of input parameters
#"initial number of genes" #how many simulated genes on each array
#"initial percentage of significant genes" #percentage of genes with
simulated spike-in fold changes
#"true number of significant genes" #number of genes with simulated
spike-in fold changes
#"initial seeded fold change min." #the minimum spiked-in fold change
#"initial seeded fold change max" #the maximum spiked-in fold change
#"number of arrays" #number of simulated arrays
#"initial mean expression" #initial mean expression level
#"initial c.v." #initial mean within-gene coeffecient of variation
#"initial actual fold change min." #initial actual minimum fold change
#"initial actual fold change max." #initial actual maximum fold change
#Post-filtered state
#"IQR percentile of filter" #percentage of the IQR distribution below
which genes will be filtered out
#"post-fitler number of genes" #number of remaining genes after IQR
filtration
#"post-filter mean expression" #mean expression of genes remaining
after filtration
#"post-filter c.v." #mean c.v. of genes remaining after filtration
#"post-filter fold change min." #post-filtration actual minimum fold change
#"post-filter fold change max." #post-filtration actual maximum fold
change
#Significant Genes
#"BH value FDR" #set FDR threshold
#"sig. genes number" #total number of genes found significant
#"sig. genes mean expression" #mean expression of significant genes
#"sig. genes c.v." #mean with-gene c.v. of signficant genes
#"sig. genes fold change min." #minimum fold change of significant genes
#"sig. genes fold change max." #maximum fold change of significant genes
#"sig. genes true FDR" # the true FDR
#"sig. genes true FNR" #the true FNR
##############################################################
#LOAD THE FUNCTIONS BELOW BEFORE RUNNING THE FOLLOWING SIMULATIONS
IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000,
coeff.var= 0.13,
FC.range = c(1,1), IQR.filt.per = 0, FDR.cutoff = 0.20,
percent.genes.real.change = 0, num.runs = 20)
IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000,
coeff.var= 0.13,
FC.range = c(1,1), IQR.filt.per = 25, FDR.cutoff = 0.20,
percent.genes.real.change = 0, num.runs = 20)
IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000,
coeff.var= 0.13,
FC.range = c(1.1, 2.25), IQR.filt.per = 0, FDR.cutoff = 0.20,
percent.genes.real.change = 3, num.runs = 20)
IQR.sim.func(num.sim.genes = 30000, num.arrays = 12, mean.exprs = 1000,
coeff.var= 0.13,
FC.range = c(1.1, 2.25), IQR.filt.per = 25, FDR.cutoff = 0.20,
percent.genes.real.change = 3, num.runs = 20)
##############################################################
##############################################################
##############################################################
###
### END OF SIMULATION
###
##############################################################
##############################################################
##############################################################
#################################################################################################
# SIMULATION FUNCTIONS
#################################################################################################
IQR.sim.func <- function(num.sim.genes, num.arrays, mean.exprs, coeff.var,
FC.range, IQR.filt.per, FDR.cutoff, percent.genes.real.change, num.runs)
{
#a couple of utility functions to use later on
#fold change calculator
abs.fc.calc.func <- function(m1,m2){
fc <- mean(m2)/mean(m1); if (fc < 1) {fc <- 1/fc} ; round(fc, digits
= 2); fc}
#coefficient of variation calculator
cv.func <- function(vec){cv <- sd(vec)/mean(vec); cv}
for (run.num in 1:num.runs)
{
##################################################################################################
#construct matrix of random normal numbers
cv <- coeff.var
mat <-abs(matrix(rnorm((num.sim.genes * num.arrays), mean = mean.exprs,
sd = (mean.exprs * cv)), nrow = num.sim.genes, ncol
= num.arrays))
# "Seed" the matrix with genes with uniform dist. of fold changes of
range min.change-max change
per.seed <- percent.genes.real.change #percentage of genes seeded
genes.seed <- (per.seed/100) * num.sim.genes #number of genes seeded
#marker for which genes are "true positives"(TRUE) and which are "true
negatives"(FALSE)
genes.seed.logical <- c(rep(TRUE, genes.seed), rep(FALSE,(num.sim.genes
- genes.seed)))
min.change <- FC.range[1]
max.change <- FC.range[2]
sim.FC <- seq(from = min.change, to = max.change, length.out = genes.seed)
if (genes.seed != 0)
{
for (i in 1:genes.seed) # Do seeding
{
mat[i,1:(num.arrays/2)] <- sim.FC[i] * mat[i, 1:(num.arrays/2)]}
}
# Characteristics of all genes
mean.exprs.all <- mean(mat)
mean.cv.exprs.all <-mean(apply(mat, 1, cv.func))
fc.vec.all <- rep(NA, nrow(mat))
for (i in 1:nrow(mat)){fc.vec.all[i] <-
abs.fc.calc.func(mat[i,1:(num.arrays/2)], mat[i,((num.arrays/2) +
1):num.arrays])}
range.fc.all <- quantile(fc.vec.all, p = seq(0,1))
#IQR filtering
IQR.cutoff <- IQR.filt.per #the percentile of IQR below which will be
filtered out
IQR.vec <- apply(log2(mat), 1, IQR) #compute IQR for each row of mat
("gene"), needs to be log2
IQR.per <- 100*rank(IQR.vec)/length(IQR.vec) #compute IQR percentile for
each "gene"
IQR.filt <- IQR.per >= IQR.cutoff #T/F filter
mat <- mat[IQR.filt,]
genes.seed.logical <- genes.seed.logical[IQR.filt]
# Characteristics of post-filtered genes
mean.exprs.filtered <- mean(mat)
mean.cv.exprs.filtered <- mean(apply(mat, 1, cv.func))
fc.vec.filtered <- rep(NA, nrow(mat))
for (i in 1:nrow(mat)){fc.vec.filtered[i] <-
abs.fc.calc.func(mat[i,1:(num.arrays/2)], mat[i,((num.arrays/2) +
1):num.arrays])}
range.fc.filtered <- quantile(fc.vec.filtered, p = seq(0,1))
# Do t-testing
p.vec <- rep(NA, nrow(mat))
for (i in 1:length(p.vec))
{
p.vec[i] <- t.test(x = mat[i,1:(num.arrays/2)], y =
mat[i,((num.arrays/2) + 1):num.arrays],
alternative = "two.sided", var.equal = TRUE)$p.value
}
hist(p.vec, breaks = 100, main=paste("Histogram of p values.
IQR.filt.per = ", IQR.filt.per,
". FC.range is ", FC.range[1], " to ", FC.range[2], sep="")
,xlab="p value")
require(geneplotter, quiet=T)
savepdf(fn = paste("hist.p.values.IQR.filt.per", IQR.filt.per,
"FC.range.", FC.range[1], FC.range[2], sep="_"), dir = getwd())
# FDR correction using BH method
p.adj.vec <- p.adjust(p.vec, method = "BH")
sig.filt <- p.adj.vec <= FDR.cutoff
tot.sig <- sum(sig.filt) #total of sig. genes
tot.true.neg.sig <- sum(sig.filt & !genes.seed.logical) #true negatives
in genes called sig.
true.FDR <- tot.true.neg.sig/tot.sig
tot.true.pos.not.sig <- sum(!sig.filt & genes.seed.logical) #true
positives in genes called not sig.
true.FNR <- tot.true.pos.not.sig/sum(genes.seed.logical)
#characteristics of total positive genes
if (sum(sig.filt) > 1)
{
mat <- mat[sig.filt,]
mean.exprs.sig <- mean(mat)
mean.cv.exprs.sig <- mean(apply(mat, 1, cv.func))
fc.vec.sig <- rep(NA, nrow(mat))
for (i in 1:nrow(mat)){fc.vec.sig[i] <-
abs.fc.calc.func(mat[i,1:(num.arrays/2)], mat[i,((num.arrays/2)
+ 1):num.arrays])}
range.fc.sig <- quantile(fc.vec.sig, p = seq(0,1))
} else {
mean.exprs.sig <- NA; mean.cv.exprs.sig <- NA; range.fc.sig
<-rep(NA,2)}
#############################################################
############################################################
# Create output datastructure
if (run.num == 1)
{
parameter <- c(
#Initial State
"number of simulation runs",
"initial number of genes",
"initial percentage of significant genes",
"true number of significant genes",
"initial seeded fold change min.",
"initial seeded fold change max",
"number of arrays",
"initial mean expression",
"initial c.v.",
"initial actual fold change min.",
"initial actual fold change max.",
#Post-filtered state
"IQR percentile of filter",
"post-fitler number of genes",
"post-filter mean expression",
"post-filter c.v.",
"post-filter fold change min.",
"post-filter fold change max.",
#Significant Genes
"BH FDR",
"sig. genes number",
"sig. genes mean expression",
"sig. genes c.v.",
"sig. genes fold change min.",
"sig. genes fold change max.",
"sig. genes true FDR",
"sig. genes true FNR")
out.mat <- matrix(ncol = 1, nrow = length(parameter))
}
#Create output of i and bind to out.mat
output.vec <-
c(num.runs,
num.sim.genes,
per.seed,
(num.sim.genes*per.seed/100),
FC.range[1],
FC.range[2],
num.arrays,
mean.exprs.all,
mean.cv.exprs.all,
range.fc.all[1],
range.fc.all[2],
#Post-filtered state
IQR.filt.per,
sum(IQR.filt),
mean.exprs.filtered,
mean.cv.exprs.filtered,
range.fc.filtered[1],
range.fc.filtered[2],
#Significant Genes
FDR.cutoff,
tot.sig,
mean.exprs.sig,
mean.cv.exprs.sig,
range.fc.sig[1],
range.fc.sig[2],
true.FDR,
true.FNR)
if (run.num ==1){out.mat[,1] <- output.vec} else {out.mat <-
cbind(out.mat, output.vec)}
print(paste("finished run ", run.num, sep = ""))
}
final.out.mat <- matrix(nrow = length(parameter), ncol = 3)
final.out.mat[,1] <- parameter
colnames(final.out.mat) <- c("parameter", "mean", "std. dev")
for (i in 1:nrow(final.out.mat))
{
err.out <- try(out.tmp <- sd(out.mat[i,]), TRUE)
if(inherits(err.out, "try-error"))
{final.out.mat[i,2:3]<-rep(NA,2)} else
{final.out.mat[i,2:3]<-c(mean(out.mat[i,]), sd(out.mat[i,]))}
}
write.table(final.out.mat, "simulation.output.csv", sep="\t",
col.names=T, row.names=F, append=T)
}
##############################################################
##############################################################
##############################################################
###
###
###
### END OF R Functions
###
###
###
##############################################################
##############################################################
##############################################################
Jenny Drnevich wrote:
> Hi Mark,
>
> I also have been suspicious of filtering out low-variance genes. There
> are 1-2 issues, depending on how you calculate the test statistic. The
> first is the possible distortion of FDR calculations. On page 234 of the
> BioC book you mention, Scholtens & Heydebreck state: "If the truly
> differentially expressed genes are overrepresented among those selected
> in the filtering step, the FDR associated with a certain threshold of
> the test statistic will be lowered due to filtering." I believe this is
> what your colleague was complaining about, as I can't see how filtering
> based on variance could do anything except overrepresent DEG. But - is
> this a good or bad thing?
>
> The second issue may not matter depending on how you calculate the test
> statistic. If you use limma, which uses a Bayesian shrinkage based on
> the variance of all genes, filtering out low-variance genes will lead to
> a higher average gene variance, and hence less shrinkage. Scholtens &
> Heydebreck also point this out on pg. 234: "Also concerning the
> /variability across samples/, a higher overall variance of the
> differentially expressed genes may be expected, because their
> between-class variance adds to their within-class variance." I've
> looked at this in some of my data sets, and the t values calculated by
> eBayes() after filtering are generally lower than they were before
> filtering. Now, the boost to FDR correction for a fewer number of genes
> tends to cancel out the lower p-values, but not always.
>
> I also would be interested in any theoretical papers on this that
> confirm or rebuke my intuitions. I myself have no problem doing a
> conservative filtering to remove genes that likely aren't expressed in
> any of the samples, but I don't filter based on variance because I use
> the Bayesian correction in limma. There was a lengthy exchange on this
> top in Feb 2006 (subject: Invalid fold-filter). Robert Gentleman
> https://stat.ethz.ch/pipermail/bioconductor/2006-February/011988.html
> said you can do simulations to convince yourself that filtering on
> variance doesn't really bias the results, at least in regards to the
> first issue I mention. The effects on the Bayesian shrinkage I don't
> think have been addressed...
>
> Cheers,
> Jenny
>
> At 11:36 PM 4/2/2007, Mark W Kimpel wrote:
>> I have been using what I consider to be non-biased filtering of
>> low-variance genes using the method described in "Bioinformatics and
>> Computational Biology Solutions using R and Bioconductor", R. Gentleman,
>> et al., page 233 for some time and have recently run into some
>> resistance from a colleague who claims that this type of filtering
>> distorts FDR calculations because it introduces bias. His reasoning is
>> that, since this method tends to filter out genes with higher p values
>> and/or lower fold changes, that it is sort of a sneaky way of
>> accomplishing just that. Of course, filtering by phenotype does
>> introduce bias, but in this case I believe that by filtering based on
>> the a priori assumption that we just aren't that interested in low
>> variance genes for biologic reasons (even if statistically significant
>> they will have very low fold changes and thus be of questionable
>> meaning) that we aren't violating the statistical underpinnings of the
>> analysis.
>>
>> I need some help in justifying this filtering step. Does anyone know of
>> a peer-reviewed reference that gives a theoretical justification for its
>> use of of any empiric experiments that show that it is legit?
>>
>> Thanks,
>> Mark
>> --
>> Mark W. Kimpel MD
>> Neuroinformatics
>> Department of Psychiatry
>> Indiana University School of Medicine
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> Jenny Drnevich, Ph.D.
>
> Functional Genomics Bioinformatics Specialist
> W.M. Keck Center for Comparative and Functional Genomics
> Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
>
> 330 ERML
> 1201 W. Gregory Dr.
> Urbana, IL 61801
> USA
>
> ph: 217-244-7355
> fax: 217-265-5066
> e-mail: drnevich at uiuc.edu
>
--
Mark W. Kimpel MD
Neuroinformatics
Department of Psychiatry
Indiana University School of Medicine
More information about the Bioconductor
mailing list