[BioC] Summarisation in oligo package after exclusion of probes containing SNPs
Jimmy Peters [guest]
guest at bioconductor.org
Sun Nov 3 17:00:12 CET 2013
Dear Benilton/BioC list,
Query: Is it possible to perform summarisation using the oligo package after exclusion of probes containing SNPs?
Background: I've performed an eQTL analysis where the expression data has been obtained on Affymetrix HuGene ST 1.1 microarrays.
After reading in the CEL files, I pre-processed the GeneFeatureSet doing background correction, quantile normalisation, and summarisation to
gene-level using the rma function from the oligo package.
# What my real data looks like:
( cels <- read.celfiles(pathToFiles) ) # generates a GeneFeatureSet
GeneFeatureSet (storageMode: lockedEnvironment)
assayData: 1178100 features, 90 samples
element names: exprs
protocolData
rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ... triad0078_H7_498_CD16.CEL (90 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ... triad0078_H7_498_CD16.CEL (90 total)
varLabels: cell_type info.batch.name ... age (29 total)
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.1.1.st.v1
#Example workfow using oligoData for reproducibilty:
library(oligo)
library(oligoData)
data(affyGeneFS)
affyGeneFS # a GeneFeatureSet
GeneFeatureSet (storageMode: lockedEnvironment)
assayData: 1102500 features, 33 samples
element names: exprs
protocolData
rowNames: TisMap_Brain_01_v1_WTGene1.CEL TisMap_Brain_02_v1_WTGene1.CEL ... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: TisMap_Brain_01_v1_WTGene1.CEL TisMap_Brain_02_v1_WTGene1.CEL ... TisMap_Thyroid_03_v1_WTGene1.CEL (33 total)
varLabels: index
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.1.0.st.v1
# generate gene-level expression values:
eset <- rma(affyGeneFS)
# equivalent to:
bgCorrected <- backgroundCorrect(affyGeneFS) # Background correct
normalized <- normalize(bgCorrected, method="quantile") # Quantile normalise
eset2 <- rma(normalized, background=F, normalize=F, subset=NULL) # Summarize with median polish
I would like to re-run the eQTL analysis, this time excluding probes from the expression data that contain
SNPs in >1% of CEU individuals to see whether this has any substantial effect on my findings. My concern is
the possibility of apparent eQTLs which are in fact artefacts due to less efficient binding between probes
and transcripts containing the minor allele.
My questions are-
1) Having obtained a list of the probes to exclude, is it valid to simply perform BG correction and quantile normalisation on the whole GeneFeatureSet,
but to then remove probes containing SNPs prior to summarisation? Would the validity of mapping the resulting probesets to Entrez ids etc.
be questionable if some probes had been excluded?
2) If such an approach is reasonable, how can I map Affymetrix probe-level identifiers to the GeneFeatureSet?
# I know how to get feature ids for PM probes and their groupings into gene-level probesets:
featureInfo <- oligo:::stArrayPmInfo(normalized, target = 'core')
head(featureInfo,10)
fid fsetid
1 116371 7892501
2 943979 7892501
3 493089 7892501
4 907039 7892501
5 1033309 7892502
6 653512 7892502
7 690769 7892502
8 997409 7892502
9 379979 7892503
10 469846 7892503
The fsetid s correspond to gene-level probesets. Are the fid s row indices in the GeneFeatureSet,
and how can they be linked to Affy probe level ids?
Assuming I can map Affy probe ids to the fid column, here is my proposed solution for
summarisation after removal of selected probes, but I wonder if there is a more straightforward way?
# for demo purposes generate a random list of fids to remove
# some fid s are duplicated in the data.frame as they map to multiple probesets, so use 'unique'
set.seed(1234)
fidToCut <- sample(unique(featureInfo$fid), 10000, replace=F)
# remove rows from GeneFeatureSet corressponding to fid s we want to exclude
normalized2 <- normalized[-fidToCut, ]
# create mapping for the new GeneFeatureSet of how old row indices map to new indices
map <- cbind(new_ind= 1:nrow(normalized2), old_ind= featureNames(normalized2) )
# pmi gives row indices of pm probes in 'normalized', not 'normalized2'
pmi <- featureInfo[["fid"]]
# re-order dataframe so 'old_ind' column is the same as pmi
map1 <- map[match(pmi, map[,"old_ind"]),]
map1 <- cbind(map1, featureInfo)
tail(map1)
new_ind old_ind fid fsetid
861488 961136 969962 969962 8180418
861489 208831 210719 210719 8180418
861490 870109 878113 878113 8180418
861491 598190 603636 603636 8180418
861492 858752 866642 866642 8180418
861493 713834 720353 720353 8180418
# we need to get rid of rows with NA... e.g. where 'fid' but no 'new_ind' exists:
# effectively recreating a new 'featureInfo' dataframe
map1 <- na.omit(map1)
# now we can get the new row indices, and the probesets they belong to
pmi <- as.numeric( as.character( map1[["new_ind"]] )) # 'new_ind' got coerced to a factor
pnVec <- as.character(map1[["fsetid"]])
# subset the normalized probe-level values to keep probes by indexes in pmi
pms <- exprs(normalized2)[pmi, , drop = FALSE]
dimnames(pms) <- NULL
colnames(pms) <- sampleNames(normalized2)
# get a matrix of summarized values, which can be used to create an ExpressionSet
theExprs <- basicRMA(pms, pnVec, normalize=F, background=F)
I'd greatly appreciate any advice. I'm a biologist by background so apologies if this is rather basic.
Thanks very much
Jimmy Peters, Smith Lab, Cambridge Institute for Medical Research.
-- output of sessionInfo():
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] pd.hugene.1.0.st.v1_3.8.0 oligoData_1.8.0 biomaRt_2.18.0 pd.hugene.1.1.st.v1_3.8.0 RSQLite_0.11.4 DBI_0.2-7
[7] oligo_1.26.0 Biobase_2.22.0 oligoClasses_1.24.0 BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] affxparser_1.34.0 affyio_1.30.0 BiocInstaller_1.12.0 Biostrings_2.30.0 bit_1.1-10 codetools_0.2-8 ff_2.2-12
[8] foreach_1.4.1 GenomicRanges_1.14.1 IRanges_1.20.0 iterators_1.0.6 preprocessCore_1.24.0 RCurl_1.95-4.1 splines_3.0.2
[15] stats4_3.0.2 tools_3.0.2 XML_3.95-0.2 XVector_0.2.0 zlibbioc_1.8.0
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list