[BioC] matching Entrez-IDs to Affy probesets using biomaRt
Marc Carlson
mcarlson at fhcrc.org
Fri Mar 14 21:48:56 CET 2014
Hi Naomi,
I don't have an answer for what biomaRt is doing here (although I bet
that they will have some kind of explanation). But if you just need to
do some quick annotation there is also a bioconductor package for that
platform that you can use called 'rat2302.db'
library(rat2302.db)
length(keys(rat2302.db, keytype='PROBEID'))
Shows that it has 31099 probeset ids.
Then to annotate some probes you could do it like this:
probes <- head(keys(rat2302.db, keytype='PROBEID'))
select(rat2302.db, keys=probes, columns=c('SYMBOL','GENENAME'),
keytype='PROBEID')
And just in case you are currently only acclimated to biomaRt, you can
learn more about how to use this package here:
http://www.bioconductor.org/packages/devel/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf
Marc
On 03/13/2014 02:01 PM, Naomi Altman wrote:
> After my premature posting yesterday, I am bit hesitant to ask, but I am
> puzzled by what I am getting from biomaRt. (To avoid clutter, I added
> the sessionInfo at the end of the message.)
>
> I used ReadAffy() to read in a rat dataset and called it CELdata.
>
> CELdata
> AffyBatch object
> size of arrays=834x834 features (19 kb)
> cdf=Rat230_2 (31099 affyids)
> number of samples=8
> number of genes=31099
> annotation=rat2302
> notes=
>
> features=featureNames(CELdata)
>> length(features)
> [1] 31099
>> sum(is.na(features))
> [1] 0
>
> I use features to query biomaRt for the Entrez-ids. I got back only 18882 probesets (but actually fewer, because some probesets are matched to 2 Entrez-ids). On the other hand, some of the Affy-ids there were returned did not match anything, so I am not sure why they were returned.
>
> matchFeature=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='affy_rat230_2', values = features, mart = ensembl)
>> dim(matchFeature)
> [1] 18882 2
>> sum(!is.na(matchFeature$affy_rat230_2))
> [1] 18882
>> sum(!is.na(matchFeature$entrezgene))
> [1] 17814
>
>
> I then use the non-missing Entrez-ids to query biomaRt for the Affy-ids. I got back only 18249 Entrez-ids (presumable because some Entrez-ids are matched to 2 probesets). Nothing is missing.
>
>
>
> matchEntrez=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='entrezgene', values = matchFeature[!is.na(matchFeature[,2]),2], mart = ensembl)
>
>> dim(matchEntrez)
> [1] 18249 2
>> sum(!is.na(matchEntrez[,1]))
> [1] 18249
>> sum(!is.na(matchEntrez[,2]))
> [1] 18249
>
>
> I am pretty sure that the discrepancies in the counts has to do with
> how getBM is handling multiple matches.
>
> length(unique(matchFeature[,1]))
> [1] 16851
>> length(unique(matchEntrez[,1]))
> [1] 16143
>> length(unique(matchFeature[,2]))
> [1] 13738
>> length(unique(matchEntrez[,2]))
> [1] 13737
>> length(unique(matchFeature[!is.na(matchFeature[,2]),1]))
> [1] 16142
>
>
>
> In any case, I seem to be missing about 13000 probesets. Surely there
> cannot be that many probesets on the array with no Entrez-id?
>
> Thanks for any help you can provide.
>
> Naomi Altman
>
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] rat2302cdf_2.13.0 hgu95av2cdf_2.13.0 AnnotationDbi_1.24.0 biomaRt_2.18.0
> [5] edgeR_3.4.2 limma_3.18.13 affy_1.40.0 Biobase_2.22.0
> [9] BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.30.0 BiocInstaller_1.12.0 DBI_0.2-7 IRanges_1.20.7
> [5] preprocessCore_1.24.0 RCurl_1.95-4.1 RSQLite_0.11.4 stats4_3.0.2
> [9] tools_3.0.2 XML_3.98-1.1 zlibbioc_1.8.0
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list