[BioC] biomart doesn't annotate all the genes queried -CEACAM6

Wed Sep 17 03:33:53 CEST 2008

Hi marc,

thanks. I'm using the hgu133plus2.db package as a control to the biomaRt package.

I think biomaRt has plenty of potential and gives users tremendous power in mapping across different organisms, databases etc. I'm quite certain I'll be using some of its functionality in future.

regards

julian

----- Original Message -----
From: "Marc Carlson" <mcarlson at fhcrc.org>
To: "Julian Lee" <julian at omniarray.com>
Cc: bioconductor at stat.math.ethz.ch
Sent: Tuesday, September 16, 2008 11:45:15 PM GMT +08:00 Beijing / Chongqing / Hong Kong / Urumqi
Subject: Re: [BioC] biomart doesn't annotate all the genes queried -CEACAM6

Hi Julian,

You could also use either of the following two standard annotation 
packages for this: org.Hs.eg.db or hgu133plus2.db.

It appears that the only field you are looking for that we don't have 
yet, is the chromosome end position, and even that should be available 
within days inside of the latest devel release. This won't help with 
your biomaRt question (and I will leave this question for the true 
biomaRt experts), but is always good to have multiple options. ;)

Marc

Julian Lee wrote:
> Dear all,
>
> I'm having some problems trying to annotate some genes, eg. CEACAM6.
>
> My problem is as follows, 
> Query  input = chromosome numbers 1:22, 
> Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol
>
>
> ##Code
>   
>> require(biomaRt)
>> ensembl<-useMart('ensembl')
>> ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl)
>> ensembl
>>     
> Object of class 'Mart':
>  Using the ensembl BioMart database
>  Using the hsapiens_gene_ensembl dataset
>
>
> ##Build Attributes of Interest
> a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2','hgnc_symbol','chromosome_name','start_position','end_position')
>   
>> a
>>     
> [1] "ensembl_gene_id"    "unigene"             "illumina_v2"        
> [4] "affy_hg_u133_plus_2" "hgnc_symbol"         "chromosome_name"    
> [7] "start_position"      "end_position"   
>
> ##Retrieving chromosome 1:22 from biomart
> getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensembl,verbose=T)->mydataset
> <?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query  virtualSchemaName = 'default' uniqueRows = '1' count = '0' datasetConfigVersion = '0.6' requestid= "biomaRt"> <Dataset name = 'hsapiens_gene_ensembl'><Attribute name = 'ensembl_gene_id'/><Attribute name = 'unigene'/><Attribute name = 'illumina_v2'/><Attribute name = 'affy_hg_u133_plus_2'/><Attribute name = 'hgnc_symbol'/><Attribute name = 'chromosome_name'/><Attribute name = 'start_position'/><Attribute name = 'end_position'/><Filter name = 'chromosome_name' value = '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22' /></Dataset></Query>
>
> ##I can't find CEACAM6 in mydataset
>   
>> grep('CEACAM6',mydataset$hgnc_symbol)
>>     
> integer(0)
>
>   
>> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2)
>>     
> integer(0)
>
>
> ##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes)
>   
>> length(unique(mydataset$affy_hg_u133_plus_2))
>>     
> [1] 23474
>
> ##However, if i'm looking for CEACAM6 using the affy probes, i can find it, 
>   
>> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s_at','211657_at'),mart=ensembl)
>>     
>   ensembl_gene_id   unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol
> 1 ENSG00000086548 Hs.602441  ILMN_21866         203757_s_at     CEACAM6
> 2 ENSG00000086548 Hs.466814  ILMN_21866         203757_s_at     CEACAM6
> 3 ENSG00000086548 Hs.602441  ILMN_21866           211657_at     CEACAM6
> 4 ENSG00000086548 Hs.466814  ILMN_21866           211657_at     CEACAM6
>   chromosome_name start_position end_position
> 1              19       46951341     46967953
> 2              19       46951341     46967953
> 3              19       46951341     46967953
> 4              19       46951341     46967953
> ##End Code
>
> I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears.
>
> Any help on this would be great. thanks. 
>
> regards
>
> btw- i couldn't find EGFR. as a control, i managed to identify TP53
>
> ##R Code
>   
>> grep('EGFR',mydataset$hgnc_symbol)
>>     
> integer(0)
>   
>> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol']
>>     
>  [1] "TP53I13"  "TP53I13"  "TP53AP1"  "TP53AP1"  "TP53AP1"  "TP53AP1" 
>  [7] "TP53BP1"  "TP53BP1"  "TP53BP1"  "TP53BP1"  "TP53BP1"  "TP53I3"  
> [13] "TP53I3"   "TP53I3"   "TP53I3"   "TP53I3"   "TP53"     "TP53"    
> [19] "TP53"     "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2"  "TP53BP2" 
> [25] "TP53BP2"  "TP53BP2" 
> ##end R Code
>
>
>   
>> sessionInfo()
>>     
> R version 2.7.1 (2008-06-23) 
> i486-pc-linux-gnu 
>
> locale:
> LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_SG.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] tools     stats     graphics  grDevices utils     datasets  methods  
> [8] base     
>
> other attached packages:
> [1] hgu133plus2.db_2.2.0            illuminaHumanv2ProbeID.db_1.1.1
> [3] AnnotationDbi_1.2.2             RSQLite_0.6-9                  
> [5] DBI_0.2-4                       Biobase_2.0.1                  
> [7] biomaRt_1.14.1                  RCurl_0.9-4                    
>
> loaded via a namespace (and not attached):
> [1] XML_1.96-0
>
>
>
>
>
>
>
>
>
>   

-- 
Julian Lee
Bioinformatics Specialist
Cellular and Molecular Research
National Cancer Center Singapore