[BioC] biomart doesn't annotate all the genes queried -CEACAM6

Tue Sep 16 11:21:32 CEST 2008

Dear all,

I'm having some problems trying to annotate some genes, eg. CEACAM6.

My problem is as follows, 
Query  input = chromosome numbers 1:22, 
Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol

##Code
>require(biomaRt)
>ensembl<-useMart('ensembl')
>ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl)
>ensembl
Object of class 'Mart':
 Using the ensembl BioMart database
 Using the hsapiens_gene_ensembl dataset

##Build Attributes of Interest
a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2','hgnc_symbol','chromosome_name','start_position','end_position')
> a
[1] "ensembl_gene_id"    "unigene"             "illumina_v2"        
[4] "affy_hg_u133_plus_2" "hgnc_symbol"         "chromosome_name"    
[7] "start_position"      "end_position"   

##Retrieving chromosome 1:22 from biomart
getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensembl,verbose=T)->mydataset
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query  virtualSchemaName = 'default' uniqueRows = '1' count = '0' datasetConfigVersion = '0.6' requestid= "biomaRt"> <Dataset name = 'hsapiens_gene_ensembl'><Attribute name = 'ensembl_gene_id'/><Attribute name = 'unigene'/><Attribute name = 'illumina_v2'/><Attribute name = 'affy_hg_u133_plus_2'/><Attribute name = 'hgnc_symbol'/><Attribute name = 'chromosome_name'/><Attribute name = 'start_position'/><Attribute name = 'end_position'/><Filter name = 'chromosome_name' value = '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22' /></Dataset></Query>

##I can't find CEACAM6 in mydataset
>grep('CEACAM6',mydataset$hgnc_symbol)
integer(0)

> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2)
integer(0)

##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes)
>length(unique(mydataset$affy_hg_u133_plus_2))
[1] 23474

##However, if i'm looking for CEACAM6 using the affy probes, i can find it, 
> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s_at','211657_at'),mart=ensembl)
  ensembl_gene_id   unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol
1 ENSG00000086548 Hs.602441  ILMN_21866         203757_s_at     CEACAM6
2 ENSG00000086548 Hs.466814  ILMN_21866         203757_s_at     CEACAM6
3 ENSG00000086548 Hs.602441  ILMN_21866           211657_at     CEACAM6
4 ENSG00000086548 Hs.466814  ILMN_21866           211657_at     CEACAM6
  chromosome_name start_position end_position
1              19       46951341     46967953
2              19       46951341     46967953
3              19       46951341     46967953
4              19       46951341     46967953
##End Code

I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears.

Any help on this would be great. thanks. 

regards

btw- i couldn't find EGFR. as a control, i managed to identify TP53

##R Code
> grep('EGFR',mydataset$hgnc_symbol)
integer(0)
> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol']
 [1] "TP53I13"  "TP53I13"  "TP53AP1"  "TP53AP1"  "TP53AP1"  "TP53AP1" 
 [7] "TP53BP1"  "TP53BP1"  "TP53BP1"  "TP53BP1"  "TP53BP1"  "TP53I3"  
[13] "TP53I3"   "TP53I3"   "TP53I3"   "TP53I3"   "TP53"     "TP53"    
[19] "TP53"     "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2"  "TP53BP2" 
[25] "TP53BP2"  "TP53BP2" 
##end R Code

> sessionInfo()
R version 2.7.1 (2008-06-23) 
i486-pc-linux-gnu 

locale:
LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_SG.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] hgu133plus2.db_2.2.0            illuminaHumanv2ProbeID.db_1.1.1
[3] AnnotationDbi_1.2.2             RSQLite_0.6-9                  
[5] DBI_0.2-4                       Biobase_2.0.1                  
[7] biomaRt_1.14.1                  RCurl_0.9-4                    

loaded via a namespace (and not attached):
[1] XML_1.96-0

-- 
Julian Lee
Bioinformatics Specialist
Cellular and Molecular Research
National Cancer Center Singapore