[BioC] biomart doesn't annotate all the genes queried -CEACAM6
Marc Carlson
mcarlson at fhcrc.org
Tue Sep 16 17:45:15 CEST 2008
Hi Julian,
You could also use either of the following two standard annotation
packages for this: org.Hs.eg.db or hgu133plus2.db.
It appears that the only field you are looking for that we don't have
yet, is the chromosome end position, and even that should be available
within days inside of the latest devel release. This won't help with
your biomaRt question (and I will leave this question for the true
biomaRt experts), but is always good to have multiple options. ;)
Marc
Julian Lee wrote:
> Dear all,
>
> I'm having some problems trying to annotate some genes, eg. CEACAM6.
>
> My problem is as follows,
> Query input = chromosome numbers 1:22,
> Output attributes = ensembl_gene_id, unigene, chromosome_number,start_position,end_position,hgnc_symbol
>
>
> ##Code
>
>> require(biomaRt)
>> ensembl<-useMart('ensembl')
>> ensembl<-useDataset('hsapiens_gene_ensembl',mart=ensembl)
>> ensembl
>>
> Object of class 'Mart':
> Using the ensembl BioMart database
> Using the hsapiens_gene_ensembl dataset
>
>
> ##Build Attributes of Interest
> a<-c('ensembl_gene_id','unigene','illumina_v2','affy_hg_u133_plus_2','hgnc_symbol','chromosome_name','start_position','end_position')
>
>> a
>>
> [1] "ensembl_gene_id" "unigene" "illumina_v2"
> [4] "affy_hg_u133_plus_2" "hgnc_symbol" "chromosome_name"
> [7] "start_position" "end_position"
>
> ##Retrieving chromosome 1:22 from biomart
> getBM(attributes=a,filters='chromosome_name',values=1:22,mart=ensembl,verbose=T)->mydataset
> <?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query virtualSchemaName = 'default' uniqueRows = '1' count = '0' datasetConfigVersion = '0.6' requestid= "biomaRt"> <Dataset name = 'hsapiens_gene_ensembl'><Attribute name = 'ensembl_gene_id'/><Attribute name = 'unigene'/><Attribute name = 'illumina_v2'/><Attribute name = 'affy_hg_u133_plus_2'/><Attribute name = 'hgnc_symbol'/><Attribute name = 'chromosome_name'/><Attribute name = 'start_position'/><Attribute name = 'end_position'/><Filter name = 'chromosome_name' value = '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22' /></Dataset></Query>
>
> ##I can't find CEACAM6 in mydataset
>
>> grep('CEACAM6',mydataset$hgnc_symbol)
>>
> integer(0)
>
>
>> grep(c('203757_s_at','211657_at'),mydataset$affy_hg_u133_plus_2)
>>
> integer(0)
>
>
> ##and the number of affy probes doesn't match to a u133plus2 chip (54,000 probes)
>
>> length(unique(mydataset$affy_hg_u133_plus_2))
>>
> [1] 23474
>
> ##However, if i'm looking for CEACAM6 using the affy probes, i can find it,
>
>> getBM(attributes=a,filters='affy_hg_u133_plus_2',values=c('203757_s_at','211657_at'),mart=ensembl)
>>
> ensembl_gene_id unigene illumina_v2 affy_hg_u133_plus_2 hgnc_symbol
> 1 ENSG00000086548 Hs.602441 ILMN_21866 203757_s_at CEACAM6
> 2 ENSG00000086548 Hs.466814 ILMN_21866 203757_s_at CEACAM6
> 3 ENSG00000086548 Hs.602441 ILMN_21866 211657_at CEACAM6
> 4 ENSG00000086548 Hs.466814 ILMN_21866 211657_at CEACAM6
> chromosome_name start_position end_position
> 1 19 46951341 46967953
> 2 19 46951341 46967953
> 3 19 46951341 46967953
> 4 19 46951341 46967953
> ##End Code
>
> I'm not too sure what's going on. Why is it when queried with chromosome numbers, CEACAM6 disappears, but when queried with affy_hg_u133_plus_2 probes, it appears.
>
> Any help on this would be great. thanks.
>
> regards
>
> btw- i couldn't find EGFR. as a control, i managed to identify TP53
>
> ##R Code
>
>> grep('EGFR',mydataset$hgnc_symbol)
>>
> integer(0)
>
>> mydataset[grep('TP53',mydataset$hgnc_symbol),'hgnc_symbol']
>>
> [1] "TP53I13" "TP53I13" "TP53AP1" "TP53AP1" "TP53AP1" "TP53AP1"
> [7] "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53BP1" "TP53I3"
> [13] "TP53I3" "TP53I3" "TP53I3" "TP53I3" "TP53" "TP53"
> [19] "TP53" "TP53INP2" "TP53INP2" "TP53INP2" "TP53BP2" "TP53BP2"
> [25] "TP53BP2" "TP53BP2"
> ##end R Code
>
>
>
>> sessionInfo()
>>
> R version 2.7.1 (2008-06-23)
> i486-pc-linux-gnu
>
> locale:
> LC_CTYPE=en_SG.UTF-8;LC_NUMERIC=C;LC_TIME=en_SG.UTF-8;LC_COLLATE=en_SG.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_SG.UTF-8;LC_PAPER=en_SG.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_SG.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] tools stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] hgu133plus2.db_2.2.0 illuminaHumanv2ProbeID.db_1.1.1
> [3] AnnotationDbi_1.2.2 RSQLite_0.6-9
> [5] DBI_0.2-4 Biobase_2.0.1
> [7] biomaRt_1.14.1 RCurl_0.9-4
>
> loaded via a namespace (and not attached):
> [1] XML_1.96-0
>
>
>
>
>
>
>
>
>
>
More information about the Bioconductor
mailing list