[BioC] biomart doesn't annotate all the genes queried -CEACAM6
Julian Lee
julian at omniarray.com
Wed Sep 17 03:33:53 CEST 2008
Hi marc,
thanks. I'm using the hgu133plus2.db package as a control to the biomaRt package.
I think biomaRt has plenty of potential and gives users tremendous power in mapping across different organisms, databases etc. I'm quite certain I'll be using some of its functionality in future.
regards
julian
----- Original Message -----
From: "Marc Carlson" <mcarlson at fhcrc.org>
To: "Julian Lee" <julian at omniarray.com>
Cc: bioconductor at stat.math.ethz.ch
Sent: Tuesday, September 16, 2008 11:45:15 PM GMT +08:00 Beijing / Chongqing / Hong Kong / Urumqi
Subject: Re: [BioC] biomart doesn't annotate all the genes queried -CEACAM6
Hi Julian,
You could also use either of the following two standard annotation
packages for this: org.Hs.eg.db or hgu133plus2.db.
It appears that the only field you are looking for that we don't have
yet, is the chromosome end position, and even that should be available
within days inside of the latest devel release. This won't help with
your biomaRt question (and I will leave this question for the true
biomaRt experts), but is always good to have multiple options. ;)
Marc
Julian Lee wrote:
> Dear all,
>
> I'm having some problems trying to annotate some genes, eg. CEACAM6.
>
> My problem is as follows,
> Query input = chromosome numbers 1:22,
> Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol
>
>
> ##Code
>
>> require(biomaRt)
>> ensembl<-useMart('ensembl')
>> ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl)
>> ensembl
>>
> Object of class 'Mart':
> Using the ensembl BioMart database
> Using the hsapiens_gene_ensembl dataset
>
>
> ##Build Attributes of Interest
> a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2','hgnc_symbol','chromosome_name','start_position','end_position')
>
>> a
>>
> [1] "ensembl_gene_id" "unigene" "illumina_v2"
> [4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name"
> [7] "start_position" "end_position"
>
> ##Retrieving chromosome 1:22 from biomart
> getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensembl,verbose=T)->mydataset
> <?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query virtualSchemaName = 'default' uniqueRows = '1' count = '0' datasetConfigVersion = '0.6' requestid= "biomaRt"> <Dataset name = 'hsapiens_gene_ensembl'><Attribute name = 'ensembl_gene_id'/><Attribute name = 'unigene'/><Attribute name = 'illumina_v2'/><Attribute name = 'affy_hg_u133_plus_2'/><Attribute name = 'hgnc_symbol'/><Attribute name = 'chromosome_name'/><Attribute name = 'start_position'/><Attribute name = 'end_position'/><Filter name = 'chromosome_name' value = '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22' /></Dataset></Query>
>
> ##I can't find CEACAM6 in mydataset
>
>> grep('CEACAM6',mydataset$hgnc_symbol)
>>
> integer(0)
>
>
>> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2)
>>
> integer(0)
>
>
> ##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes)
>
>> length(unique(mydataset$affy_hg_u133_plus_2))
>>
> [1] 23474
>
> ##However, if i'm looking for CEACAM6 using the affy probes, i can find it,
>
>> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s_at','211657_at'),mart=ensembl)
>>
> ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol
> 1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at CEACAM6
> 2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at CEACAM6
> 3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at CEACAM6
> 4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at CEACAM6
> chromosome_name start_position end_position
> 1 19 46951341 46967953
> 2 19 46951341 46967953
> 3 19 46951341 46967953
> 4 19 46951341 46967953
> ##End Code
>
> I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears.
>
> Any help on this would be great. thanks.
>
> regards
>
> btw- i couldn't find EGFR. as a control, i managed to identify TP53
>
> ##R Code
>
>> grep('EGFR',mydataset$hgnc_symbol)
>>
> integer(0)
>
>> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol']
>>
> [1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1"
> [7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3"
> [13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53"
> [19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2"
> [25] "TP53BP2" "TP53BP2"
> ##end R Code
>
>
>
>> sessionInfo()
>>
> R version 2.7.1 (2008-06-23)
> i486-pc-linux-gnu
>
> locale:
> LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_SG.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] tools stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1
> [3] AnnotationDbi_1.2.2 RSQLite_0.6-9
> [5] DBI_0.2-4 Biobase_2.0.1
> [7] biomaRt_1.14.1 RCurl_0.9-4
>
> loaded via a namespace (and not attached):
> [1] XML_1.96-0
>
>
>
>
>
>
>
>
>
>
--
Julian Lee
Bioinformatics Specialist
Cellular and Molecular Research
National Cancer Center Singapore
More information about the Bioconductor
mailing list