[BioC] Summarisation in oligo package after exclusion of probes containing SNPs

Tue Nov 5 18:38:53 CET 2013

Hi Dan,

Thanks very much for your reply.

Yes I'd be very interested in trying to implement your approach

Best wishes

Jimmy

On 2013-11-05 16:59, Daniel Bottomly wrote:
> Hi Jimmy:
> 
> 1.) We do variations on this procedure quite frequently in the field of
> mouse genetics and do find it to be useful in practice. The most common
> way that I have seen is to remove the probes ahead of time prior to
> background correction.  That being said, it would be interesting to
> compare the effect of probe removal ahead of time to probe removal 
> after
> background correction and normalization.  I suppose there will be cases
> where the removal of a probe from a probeset would remove an assignment 
> to
> an Entrez gene ID such as the case where the probeset is removed in its
> entirety.  You could always use (re)alignments of the probes to 
> directly
> determine which genes are interrogated.
> 
> 2.) I will defer to Benilton or others in terms of the best way to do
> this.  If you are interested and willing to get your hands a little 
> dirty
> I could point you to our approach based on the oligo package which is
> still under development but does contain documentation, unit tests, 
> etc.
> 
> Thanks,
> 
> Dan
> 
> On 11/3/13 7:00 AM, "Jimmy Peters [guest]" <guest at bioconductor.org> 
> wrote:
> 
>> 
>> Dear Benilton/BioC list,
>> 
>> Query: Is it possible to perform summarisation using the oligo package
>> after exclusion of probes containing SNPs?
>> 
>> Background: I've performed an eQTL analysis where the expression data 
>> has
>> been obtained on Affymetrix HuGene ST 1.1 microarrays.
>> After reading in the CEL files, I pre-processed the GeneFeatureSet 
>> doing
>> background correction, quantile normalisation, and summarisation to
>> gene-level using the rma function from the oligo package.
>> 
>> # What my real data looks like:
>> ( cels <- read.celfiles(pathToFiles) ) # generates a GeneFeatureSet
>> 
>> GeneFeatureSet (storageMode: lockedEnvironment)
>> assayData: 1178100 features, 90 samples
>> element names: exprs
>> protocolData
>> rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ... 
>> triad0078_H7_498_CD16.CEL
>> (90 total)
>> varLabels: exprs dates
>> varMetadata: labelDescription channel
>> phenoData
>> rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ... 
>> triad0078_H7_498_CD16.CEL
>> (90 total)
>> varLabels: cell_type info.batch.name ... age (29 total)
>> varMetadata: labelDescription channel
>> featureData: none
>> experimentData: use 'experimentData(object)'
>> Annotation: pd.hugene.1.1.st.v1
>> 
>> #Example workfow using oligoData for reproducibilty:
>> 
>> library(oligo)
>> library(oligoData)
>> 
>> data(affyGeneFS)
>> affyGeneFS        # a GeneFeatureSet
>> 
>> GeneFeatureSet (storageMode: lockedEnvironment)
>> assayData: 1102500 features, 33 samples
>> element names: exprs
>> protocolData
>> rowNames: TisMap_Brain_01_v1_WTGene1.CEL 
>> TisMap_Brain_02_v1_WTGene1.CEL
>> ... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
>> varLabels: exprs dates
>> varMetadata: labelDescription channel
>> phenoData
>> rowNames: TisMap_Brain_01_v1_WTGene1.CEL 
>> TisMap_Brain_02_v1_WTGene1.CEL
>> ... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
>> varLabels: index
>> varMetadata: labelDescription channel
>> featureData: none
>> experimentData: use 'experimentData(object)'
>> Annotation: pd.hugene.1.0.st.v1
>> 
>> # generate gene-level expression values:
>> eset <- rma(affyGeneFS)
>> 
>> # equivalent to:
>> bgCorrected <- backgroundCorrect(affyGeneFS)                          
>> #
>> Background correct
>> normalized <- normalize(bgCorrected,  method="quantile")              
>> #
>> Quantile normalise
>> eset2 <- rma(normalized, background=F, normalize=F, subset=NULL)      
>> #
>> Summarize with median polish
>> 
>> I would like to re-run the eQTL analysis, this time excluding probes 
>> from
>> the expression data that contain
>> SNPs in >1% of CEU individuals to see whether this has any substantial
>> effect on my findings. My concern is
>> the possibility of apparent eQTLs which are in fact artefacts due to 
>> less
>> efficient binding between probes
>> and transcripts containing the minor allele.
>> 
>> My questions are-
>> 
>> 1) Having obtained a list of the probes to exclude, is it valid to 
>> simply
>> perform BG correction and quantile normalisation on the whole
>> GeneFeatureSet,
>> but to then remove probes containing SNPs prior to summarisation? 
>> Would
>> the validity of mapping the resulting probesets to Entrez ids etc.
>> be questionable if some probes had been excluded?
>> 
>> 2) If such an approach is reasonable, how can I map Affymetrix
>> probe-level identifiers to the GeneFeatureSet?
>> 
>> # I know how to get feature ids for PM probes and their groupings into
>> gene-level probesets:
>> featureInfo <- oligo:::stArrayPmInfo(normalized, target = 'core')
>> 
>> head(featureInfo,10)
>> 
>>       fid  fsetid
>> 1   116371 7892501
>> 2   943979 7892501
>> 3   493089 7892501
>> 4   907039 7892501
>> 5  1033309 7892502
>> 6   653512 7892502
>> 7   690769 7892502
>> 8   997409 7892502
>> 9   379979 7892503
>> 10  469846 7892503
>> 
>> The fsetid s correspond to gene-level probesets. Are the fid s row
>> indices in the GeneFeatureSet,
>> and how can they be linked to Affy probe level ids?
>> 
>> Assuming I can map Affy probe ids to the fid column, here is my 
>> proposed
>> solution for
>> summarisation after removal of selected probes, but I wonder if there 
>> is
>> a more straightforward way?
>> 
>> # for demo purposes generate a random list of fids to remove
>> # some fid s are duplicated in the data.frame as they map to multiple
>> probesets, so use 'unique'
>> set.seed(1234)
>> fidToCut <- sample(unique(featureInfo$fid), 10000, replace=F)
>> 
>> # remove rows from GeneFeatureSet corressponding to fid s we want to
>> exclude
>> normalized2 <- normalized[-fidToCut, ]
>> 
>> # create mapping for the new GeneFeatureSet of how old row indices map 
>> to
>> new indices
>> map <- cbind(new_ind= 1:nrow(normalized2), old_ind=
>> featureNames(normalized2) )
>> 
>> # pmi gives row indices of pm probes in 'normalized', not 
>> 'normalized2'
>> pmi <-  featureInfo[["fid"]]
>> 
>> # re-order dataframe so 'old_ind' column is the same as pmi
>> map1 <- map[match(pmi, map[,"old_ind"]),]
>> map1 <- cbind(map1, featureInfo)
>> 
>> tail(map1)
>>       new_ind old_ind    fid  fsetid
>> 861488  961136  969962 969962 8180418
>> 861489  208831  210719 210719 8180418
>> 861490  870109  878113 878113 8180418
>> 861491  598190  603636 603636 8180418
>> 861492  858752  866642 866642 8180418
>> 861493  713834  720353 720353 8180418
>> 
>> # we need to get rid of rows with NA... e.g. where 'fid' but no 
>> 'new_ind'
>> exists:
>> # effectively recreating a new 'featureInfo' dataframe
>> map1 <- na.omit(map1)
>> 
>> # now we can get the new row indices, and the probesets they belong to
>> pmi <-  as.numeric( as.character( map1[["new_ind"]] )) # 'new_ind' got
>> coerced to a factor
>> pnVec <- as.character(map1[["fsetid"]])
>> 
>> # subset the normalized probe-level values to keep probes by indexes 
>> in
>> pmi
>> pms <- exprs(normalized2)[pmi, , drop = FALSE]
>> dimnames(pms) <- NULL
>> colnames(pms) <- sampleNames(normalized2)
>> # get a matrix of summarized values, which can be used to create an
>> ExpressionSet
>> theExprs <- basicRMA(pms, pnVec, normalize=F, background=F)
>> 
>> I'd greatly appreciate any advice. I'm a biologist by background so
>> apologies if this is rather basic.
>> Thanks very much
>> Jimmy Peters, Smith Lab, Cambridge Institute for Medical Research.
>> 
>> -- output of sessionInfo():
>> 
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>> 
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>> 
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  
>> methods
>> base
>> 
>> other attached packages:
>> [1] pd.hugene.1.0.st.v1_3.8.0 oligoData_1.8.0           biomaRt_2.18.0
>>         pd.hugene.1.1.st.v1_3.8.0 RSQLite_0.11.4            DBI_0.2-7
>> 
>> [7] oligo_1.26.0              Biobase_2.22.0
>> oligoClasses_1.24.0       BiocGenerics_0.8.0
>> 
>> loaded via a namespace (and not attached):
>> [1] affxparser_1.34.0     affyio_1.30.0         BiocInstaller_1.12.0
>> Biostrings_2.30.0     bit_1.1-10            codetools_0.2-8
>> ff_2.2-12
>> [8] foreach_1.4.1         GenomicRanges_1.14.1  IRanges_1.20.0
>> iterators_1.0.6       preprocessCore_1.24.0 RCurl_1.95-4.1
>> splines_3.0.2
>> [15] stats4_3.0.2          tools_3.0.2           XML_3.95-0.2
>> XVector_0.2.0         zlibbioc_1.8.0
>> 
>> --
>> Sent via the guest posting facility at bioconductor.org.
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor