[BioC] biomart doesn't annotate all the genes queried -CEACAM6
Julian Lee
julian at omniarray.com
Tue Sep 16 11:21:32 CEST 2008
Dear all,
I'm having some problems trying to annotate some genes, eg. CEACAM6.
My problem is as follows,
Query input = chromosome numbers 1:22,
Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol
##Code
>require(biomaRt)
>ensembl<-useMart('ensembl')
>ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl)
>ensembl
Object of class 'Mart':
Using the ensembl BioMart database
Using the hsapiens_gene_ensembl dataset
##Build Attributes of Interest
a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2','hgnc_symbol','chromosome_name','start_position','end_position')
> a
[1] "ensembl_gene_id" "unigene" "illumina_v2"
[4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name"
[7] "start_position" "end_position"
##Retrieving chromosome 1:22 from biomart
getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensembl,verbose=T)->mydataset
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query virtualSchemaName = 'default' uniqueRows = '1' count = '0' datasetConfigVersion = '0.6' requestid= "biomaRt"> <Dataset name = 'hsapiens_gene_ensembl'><Attribute name = 'ensembl_gene_id'/><Attribute name = 'unigene'/><Attribute name = 'illumina_v2'/><Attribute name = 'affy_hg_u133_plus_2'/><Attribute name = 'hgnc_symbol'/><Attribute name = 'chromosome_name'/><Attribute name = 'start_position'/><Attribute name = 'end_position'/><Filter name = 'chromosome_name' value = '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22' /></Dataset></Query>
##I can't find CEACAM6 in mydataset
>grep('CEACAM6',mydataset$hgnc_symbol)
integer(0)
> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2)
integer(0)
##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes)
>length(unique(mydataset$affy_hg_u133_plus_2))
[1] 23474
##However, if i'm looking for CEACAM6 using the affy probes, i can find it,
> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s_at','211657_at'),mart=ensembl)
ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol
1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at CEACAM6
2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at CEACAM6
3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at CEACAM6
4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at CEACAM6
chromosome_name start_position end_position
1 19 46951341 46967953
2 19 46951341 46967953
3 19 46951341 46967953
4 19 46951341 46967953
##End Code
I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears.
Any help on this would be great. thanks.
regards
btw- i couldn't find EGFR. as a control, i managed to identify TP53
##R Code
> grep('EGFR',mydataset$hgnc_symbol)
integer(0)
> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol']
[1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1"
[7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3"
[13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53"
[19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2"
[25] "TP53BP2" "TP53BP2"
##end R Code
> sessionInfo()
R version 2.7.1 (2008-06-23)
i486-pc-linux-gnu
locale:
LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_SG.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDENTIFICATION=C
attached base packages:
[1] tools stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1
[3] AnnotationDbi_1.2.2 RSQLite_0.6-9
[5] DBI_0.2-4 Biobase_2.0.1
[7] biomaRt_1.14.1 RCurl_0.9-4
loaded via a namespace (and not attached):
[1] XML_1.96-0
--
Julian Lee
Bioinformatics Specialist
Cellular and Molecular Research
National Cancer Center Singapore
More information about the Bioconductor
mailing list