[BioC] Some Genefilter questions

Thu Nov 30 20:47:20 CET 2006

Hi,

It may be worth pointing out that a related question can have a huge 
impact on normalization of certain glass arrays. One of the standard 
protocols on the Agilent 44K human arrays causes several hundred control 
spots to light up extremely brightly in the green channel, but remain 
completely off in the red channel.  If you leave these control spots in 
the data set when you normalize between channels (i.e., within arrays), 
every known normalization methods breaks -- in the precise sense that it 
will systematically distort the comparison between the red and green 
channels.  If you then model the data incorporating a dye effect, you 
will think that almost every gene exhibits a dye bias.  On the other 
hand, if you remove these control spots before normalizing between 
channels, then modeling the dye bias suggest that it rarely exists....

As for the question originally asked here, I would not expect the 
foreign species probes to break the normalization (unless they somehow 
light up in one group of samples but not in the other). So, my own bias 
would be to keep them for background correction and normalization, but 
remove them before the rest of the analysis.

Best,
	Kevin

Jenny Drnevich wrote:
> Hi Amy,
> 
> Don't you just love it when you get one response suggesting you do one 
> thing (remove malarial genes after pre-processing) and another response 
> suggesting the opposite?  Although I think in this case Robert was 
> suggesting you remove them after pre-processing because it was easier than 
> trying to modify either the normalization code or the cdf environment, 
> which is what Jim pointed out to you. I ran into this same problem with 
> having probesets for other species on the soybean array, which is why I 
> used Ariel's code. I think that if you're using a mixed species array but 
> only put one of the species on it, then you should remove the other 
> species' probesets BEFORE doing the normalization because they really have 
> no bearing on the transcriptome you're trying to measure. On the other 
> hand, if you also want to filter your species' probesets based on 
> presence/absence, minimum cutoff, variation, etc.* , then you should filter 
> these genes AFTER doing the pre-processing because these probesets do 
> contain information about the transcriptome, even if it is just 'not 
> detectably expressed'.
> 
> Cheers,
> Jenny
> 
> * Contrary to Robert, I prefer to filter on presence/absence (using Affy's 
> calls) rather than variability :) I don't know if there is any 
> documentation on which may be "better"...
> 
> At 05:15 PM 11/29/2006, Robert Gentleman wrote:
>> Hi,
>>
>> Amy Mikhail wrote:
>>> Dear Bioconductors,
>>>
>>> I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles
>>> mosquito samples hybridised to them (i.e. they are not infected
>>> mosquitoes).  The 6 chips include 3 replicates, each consisting of two
>>> time points.  The design matrix is as follows:
>>>
>>>> design
>>>      M15d M43d
>>> [1,]    1    0
>>> [2,]    0    1
>>> [3,]    1    0
>>> [4,]    0    1
>>> [5,]    1    0
>>> [6,]    0    1
>>>
>>>
>>> I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy).
>>> Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0
>>> DE genes, respectively... much less than I was expecting.
>>>
>>> As this affy chip contains probesets for both mosquito and malaria
>>> parasite genes, I am wondering:
>>>
>>> (a) if it is better to remove all the parasite probesets before my 
>> analysis;
>>
>>   Yes, if you don't intend to use them, and they are not relevant to
>> your analysis. There is no point in doing p-value corrections for tests
>> you know are not interesting/relevant a priori.
>>
>>> (b) if so at what stage I should do this (before or after normalisation
>>> and background correction, or does it matter?)
>>   After both and prior to analysis - otherwise you are likely to need to
>> do some serious tweaking of the normalization code.
>>
>>> (c) how would I filter out these probesets using genefilter (all the
>>> parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs
>>> to filter out the probesets, and if so how?)
>>    you don't need genefilter at all, this is a subseting problem.
>>   If you had an ExpressionSet you would do something like:
>>
>>    parasites = grep("^Pf", featureNames(myExpressionSet))
>>
>>    mySubset = myExpressionSet[!parasites,]
>>
>>> Secondly, I did not add any of the polyA controls to my samples.  I would
>>> like to know:
>>>
>>> (d) Do any of the bg correct / normalisation methods I tried utilise
>>> affymetrix control probesets, and if so, how?
>>    I doubt it.
>>
>>> (e) Should I also filter out the control sets - again, if so at what stage
>>> in the analysis and what would be an appropriate code to use?
>>>
>>    same place as you filter the parasite genes and pretty much in the
>> same way. They are likely to start with AFFX.
>>
>>> I did try the code for non-specific filtering (on my RMA dataset) from pg.
>>> 232 of the bioconductor monograph, but the reduction in the number of
>>> probesets was quite drastic;
>>>
>>>> f1 <- pOverA(0.25, log2(100))
>>>> f2 <- function(x) (IQR(x) > 0.5)
>>   that is a typo in the text - you probably want to filter out those
>> with IQR below the median, not for some fixed value.
>>
>>>> ff <- filterfun(f1, f2)
>>>> selected <- genefilter(Baseage.transformed, ff)
>>>> sum(selected)
>>> [1] 404   ###(The origninal no. of probesets is 22,726)###
>>>> Baseage.sub <- Baseage.transformed[selected, ]
>>> Also, I understood from the monograph that "100" was to filter out
>>> fluorescence intensities less than this, but I am not clear if this is
>>> from raw intensities or log2 values?
>>   raw - 100 on the log2 scale is larger than can be represented in the
>> image file formats used. And don't do that - it is not a good idea -
>> filter on variability.
>>
>>
>>> All the parasite probesets have raw intensities <35 .... so could I apply
>>> this as a simple filter, and would this have to be on raw (rather than
>>> normalised data)?
>>
>>   Best wishes
>>     Robert
>>
>>> Appologies for the long posting...
>>>
>>> Looking forward to any replies,
>>> Regards,
>>> Amy
>>>
>>>> sessionInfo()
>>> R version 2.4.0 (2006-10-03)
>>> i386-pc-mingw32
>>>
>>> locale:
>>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>> States.1252;LC_MONETARY=English_United
>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>>  [1] "tcltk"     "splines"   "tools"     "methods"   "stats"
>>> "graphics"  "grDevices" "utils"     "datasets"  "base"
>>>
>>> other attached packages:
>>> plasmodiumanophelescdf              tkWidgets                 DynDoc
>>>      widgetTools            agahomology
>>>               "1.14.0"               "1.12.0"               "1.12.0"
>>>         "1.10.0"               "1.14.2"
>>>                affyPLM                  gcrma            matchprobes
>>>         affydata                annaffy
>>>               "1.10.0"                "2.6.0"                "1.6.0"
>>>         "1.10.0"                "1.6.0"
>>>                   KEGG                     GO                  limma
>>>      geneplotter               annotate
>>>               "1.14.0"               "1.14.0"                "2.9.1"
>>>         "1.12.0"               "1.12.0"
>>>                   affy                 affyio             genefilter
>>>         survival                Biobase
>>>               "1.12.0"                "1.2.0"               "1.12.0"
>>>           "2.29"               "1.12.0"
>>>
>>>
>>> -------------------------------------------
>>> Amy Mikhail
>>> Research student
>>> University of Aberdeen
>>> Zoology Building
>>> Tillydrone Avenue
>>> Aberdeen AB24 2TZ
>>> Scotland
>>> Email: a.mikhail at abdn.ac.uk
>>> Phone: 00-44-1224-272880 (lab)
>>>        00-44-1224-273256 (office)
>>>