[BioC] Summarisation in oligo package after exclusion of probes containing SNPs
Dr J. Peters
jp549 at cam.ac.uk
Tue Nov 5 18:38:53 CET 2013
Hi Dan,
Thanks very much for your reply.
Yes I'd be very interested in trying to implement your approach
Best wishes
Jimmy
On 2013-11-05 16:59, Daniel Bottomly wrote:
> Hi Jimmy:
>
> 1.) We do variations on this procedure quite frequently in the field of
> mouse genetics and do find it to be useful in practice. The most common
> way that I have seen is to remove the probes ahead of time prior to
> background correction. That being said, it would be interesting to
> compare the effect of probe removal ahead of time to probe removal
> after
> background correction and normalization. I suppose there will be cases
> where the removal of a probe from a probeset would remove an assignment
> to
> an Entrez gene ID such as the case where the probeset is removed in its
> entirety. You could always use (re)alignments of the probes to
> directly
> determine which genes are interrogated.
>
> 2.) I will defer to Benilton or others in terms of the best way to do
> this. If you are interested and willing to get your hands a little
> dirty
> I could point you to our approach based on the oligo package which is
> still under development but does contain documentation, unit tests,
> etc.
>
> Thanks,
>
> Dan
>
> On 11/3/13 7:00 AM, "Jimmy Peters [guest]" <guest at bioconductor.org>
> wrote:
>
>>
>> Dear Benilton/BioC list,
>>
>> Query: Is it possible to perform summarisation using the oligo package
>> after exclusion of probes containing SNPs?
>>
>> Background: I've performed an eQTL analysis where the expression data
>> has
>> been obtained on Affymetrix HuGene ST 1.1 microarrays.
>> After reading in the CEL files, I pre-processed the GeneFeatureSet
>> doing
>> background correction, quantile normalisation, and summarisation to
>> gene-level using the rma function from the oligo package.
>>
>> # What my real data looks like:
>> ( cels <- read.celfiles(pathToFiles) ) # generates a GeneFeatureSet
>>
>> GeneFeatureSet (storageMode: lockedEnvironment)
>> assayData: 1178100 features, 90 samples
>> element names: exprs
>> protocolData
>> rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ...
>> triad0078_H7_498_CD16.CEL
>> (90 total)
>> varLabels: exprs dates
>> varMetadata: labelDescription channel
>> phenoData
>> rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ...
>> triad0078_H7_498_CD16.CEL
>> (90 total)
>> varLabels: cell_type info.batch.name ... age (29 total)
>> varMetadata: labelDescription channel
>> featureData: none
>> experimentData: use 'experimentData(object)'
>> Annotation: pd.hugene.1.1.st.v1
>>
>> #Example workfow using oligoData for reproducibilty:
>>
>> library(oligo)
>> library(oligoData)
>>
>> data(affyGeneFS)
>> affyGeneFS # a GeneFeatureSet
>>
>> GeneFeatureSet (storageMode: lockedEnvironment)
>> assayData: 1102500 features, 33 samples
>> element names: exprs
>> protocolData
>> rowNames: TisMap_Brain_01_v1_WTGene1.CEL
>> TisMap_Brain_02_v1_WTGene1.CEL
>> ... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
>> varLabels: exprs dates
>> varMetadata: labelDescription channel
>> phenoData
>> rowNames: TisMap_Brain_01_v1_WTGene1.CEL
>> TisMap_Brain_02_v1_WTGene1.CEL
>> ... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
>> varLabels: index
>> varMetadata: labelDescription channel
>> featureData: none
>> experimentData: use 'experimentData(object)'
>> Annotation: pd.hugene.1.0.st.v1
>>
>> # generate gene-level expression values:
>> eset <- rma(affyGeneFS)
>>
>> # equivalent to:
>> bgCorrected <- backgroundCorrect(affyGeneFS)
>> #
>> Background correct
>> normalized <- normalize(bgCorrected, method="quantile")
>> #
>> Quantile normalise
>> eset2 <- rma(normalized, background=F, normalize=F, subset=NULL)
>> #
>> Summarize with median polish
>>
>> I would like to re-run the eQTL analysis, this time excluding probes
>> from
>> the expression data that contain
>> SNPs in >1% of CEU individuals to see whether this has any substantial
>> effect on my findings. My concern is
>> the possibility of apparent eQTLs which are in fact artefacts due to
>> less
>> efficient binding between probes
>> and transcripts containing the minor allele.
>>
>> My questions are-
>>
>> 1) Having obtained a list of the probes to exclude, is it valid to
>> simply
>> perform BG correction and quantile normalisation on the whole
>> GeneFeatureSet,
>> but to then remove probes containing SNPs prior to summarisation?
>> Would
>> the validity of mapping the resulting probesets to Entrez ids etc.
>> be questionable if some probes had been excluded?
>>
>> 2) If such an approach is reasonable, how can I map Affymetrix
>> probe-level identifiers to the GeneFeatureSet?
>>
>> # I know how to get feature ids for PM probes and their groupings into
>> gene-level probesets:
>> featureInfo <- oligo:::stArrayPmInfo(normalized, target = 'core')
>>
>> head(featureInfo,10)
>>
>> fid fsetid
>> 1 116371 7892501
>> 2 943979 7892501
>> 3 493089 7892501
>> 4 907039 7892501
>> 5 1033309 7892502
>> 6 653512 7892502
>> 7 690769 7892502
>> 8 997409 7892502
>> 9 379979 7892503
>> 10 469846 7892503
>>
>> The fsetid s correspond to gene-level probesets. Are the fid s row
>> indices in the GeneFeatureSet,
>> and how can they be linked to Affy probe level ids?
>>
>> Assuming I can map Affy probe ids to the fid column, here is my
>> proposed
>> solution for
>> summarisation after removal of selected probes, but I wonder if there
>> is
>> a more straightforward way?
>>
>> # for demo purposes generate a random list of fids to remove
>> # some fid s are duplicated in the data.frame as they map to multiple
>> probesets, so use 'unique'
>> set.seed(1234)
>> fidToCut <- sample(unique(featureInfo$fid), 10000, replace=F)
>>
>> # remove rows from GeneFeatureSet corressponding to fid s we want to
>> exclude
>> normalized2 <- normalized[-fidToCut, ]
>>
>> # create mapping for the new GeneFeatureSet of how old row indices map
>> to
>> new indices
>> map <- cbind(new_ind= 1:nrow(normalized2), old_ind=
>> featureNames(normalized2) )
>>
>> # pmi gives row indices of pm probes in 'normalized', not
>> 'normalized2'
>> pmi <- featureInfo[["fid"]]
>>
>> # re-order dataframe so 'old_ind' column is the same as pmi
>> map1 <- map[match(pmi, map[,"old_ind"]),]
>> map1 <- cbind(map1, featureInfo)
>>
>> tail(map1)
>> new_ind old_ind fid fsetid
>> 861488 961136 969962 969962 8180418
>> 861489 208831 210719 210719 8180418
>> 861490 870109 878113 878113 8180418
>> 861491 598190 603636 603636 8180418
>> 861492 858752 866642 866642 8180418
>> 861493 713834 720353 720353 8180418
>>
>> # we need to get rid of rows with NA... e.g. where 'fid' but no
>> 'new_ind'
>> exists:
>> # effectively recreating a new 'featureInfo' dataframe
>> map1 <- na.omit(map1)
>>
>> # now we can get the new row indices, and the probesets they belong to
>> pmi <- as.numeric( as.character( map1[["new_ind"]] )) # 'new_ind' got
>> coerced to a factor
>> pnVec <- as.character(map1[["fsetid"]])
>>
>> # subset the normalized probe-level values to keep probes by indexes
>> in
>> pmi
>> pms <- exprs(normalized2)[pmi, , drop = FALSE]
>> dimnames(pms) <- NULL
>> colnames(pms) <- sampleNames(normalized2)
>> # get a matrix of summarized values, which can be used to create an
>> ExpressionSet
>> theExprs <- basicRMA(pms, pnVec, normalize=F, background=F)
>>
>> I'd greatly appreciate any advice. I'm a biologist by background so
>> apologies if this is rather basic.
>> Thanks very much
>> Jimmy Peters, Smith Lab, Cambridge Institute for Medical Research.
>>
>> -- output of sessionInfo():
>>
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets
>> methods
>> base
>>
>> other attached packages:
>> [1] pd.hugene.1.0.st.v1_3.8.0 oligoData_1.8.0 biomaRt_2.18.0
>> pd.hugene.1.1.st.v1_3.8.0 RSQLite_0.11.4 DBI_0.2-7
>>
>> [7] oligo_1.26.0 Biobase_2.22.0
>> oligoClasses_1.24.0 BiocGenerics_0.8.0
>>
>> loaded via a namespace (and not attached):
>> [1] affxparser_1.34.0 affyio_1.30.0 BiocInstaller_1.12.0
>> Biostrings_2.30.0 bit_1.1-10 codetools_0.2-8
>> ff_2.2-12
>> [8] foreach_1.4.1 GenomicRanges_1.14.1 IRanges_1.20.0
>> iterators_1.0.6 preprocessCore_1.24.0 RCurl_1.95-4.1
>> splines_3.0.2
>> [15] stats4_3.0.2 tools_3.0.2 XML_3.95-0.2
>> XVector_0.2.0 zlibbioc_1.8.0
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list