[BioC] biomaRt- incorrect number of transcripts

Wed Nov 11 12:06:41 CET 2009

Dear Robert,
I have looked into this query and it seems that you did not retrieve  
unique results from the Biomart interface. I have carried out your  
query using the webExmaple.pl script provided in the biomart-perl  
directory using unique for one query and not using it for a second run  
of this query. When I do not select uniqueRows I get ~48000 rows and  
when I select uniqueRows I get ~36000 rows. I have attached the XML  
for the query I performed with uniqueRows selected (uniqueRows = "1").

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "TSV" header = "0"  
uniqueRows = "1" count = "" datasetConfigVersion = "0.7" >

   <Dataset name = "mmusculus_gene_ensembl" interface = "default" >
     <Filter name = "chromosome_name" value =  
"1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,Y,MT"/>
     <Attribute name = "ensembl_transcript_id" />
     <Attribute name = "chromosome_name" />
     <Attribute name = "strand" />
     <Attribute name = "transcript_start" />
     <Attribute name = "transcript_end" />
     </Dataset>
</Query>

I hope this resolves the issue for you but please do not hesitate to  
contact me if you need further clarification.
Kind regards
Rhoda

On 10 Nov 2009, at 19:25, Steffen at stat.berkeley.edu wrote:

> Dear Robert,
>
> Would it be possible to check if there are duplicates in the result  
> you
> obtain via the web?  By default biomaRt will retrieve only unique  
> results,
> sometimes when you query over the web results are duplicated.  To  
> remove
> these you need to check the unique only checkbox when exporting your
> results to a file.  Can you let me know if that explains the  
> difference in
> number of transcripts you notice?
>
> Cheers,
> Steffen
>
>> Dear mailing list,
>>
>> I have recently observed a discrepancies in genome annotation  
>> obtained
>> via R package biomaRt.
>> I wanted to download all ensembl transcripts from the entire mouse
>> genome (chromosome 1:19, X, Y MT only).
>>
>> When I set the filter based on chromosome names I retrieved ~36000
>> transcript, please see the code below.
>> However by using the web service www.biomart.org I received ~48000
>> transcripts for the same genome version and chromosomes.
>>
>> By comparing these two data frames you could see that the  
>> discrepancies
>> in number of transcripts occur only for some chromosomes (3:9 and X).
>> If I specified only two chromosome names (2 and 3) than the number of
>> downloaded transcripts is correct for both of them.
>> If I did not set any filter in getBM function and did the filtering
>> manually in R, the number of transcripts is correct.
>>
>> Session info is attached.
>>
>> Best Regards
>> Robert
>>
>> --
>> Robert Ivanek
>> Postdoctoral Fellow Schuebeler Group
>> Friedrich Miescher Institute
>> Maulbeerstrasse 66
>> 4058 Basel / Switzerland
>> Office phone: +41 61 697 6100
>>
>>
>> R> library("biomaRt")
>> R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
>> R> chroms <- c(1:19,"X","Y","MT")
>> R> table(getBM(attributes = c("ensembl_transcript_id",
>> "chromosome_name", "strand", "transcript_start", "transcript_end"),
>> filters = "chromosome_name", values = chroms, mart =
>> ensembl)$chromosome_name)
>>
>>   1   10   11   12   13   14   15   16   17   18   19    2    3    4
>> 5    6    7    8    9   MT    X    Y
>> 2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 1080 1454
>> 845 1209 1487 1129 1031   41 2072   17
>>
>> R> ens.web <- read.delim("../../../ 
>> mart_export.txt",stringsAsFactors=F)
>> R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,]
>> R> table(ens.web$Chromosome.Name)
>>
>>   1   10   11   12   13   14   15   16   17   18   19    2    3    4
>> 5    6    7    8    9   MT    X    Y
>> 2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 2179 3997
>> 2822 2524 3919 2021 2163   41 3297   17
>>
>> R> table(getBM(attributes = c("ensembl_transcript_id",
>> "chromosome_name", "strand", "transcript_start", "transcript_end"),
>> filters = "chromosome_name", values = c("2","3","MT"), mart =
>> ensembl)$chromosome_name)
>>
>>   2    3   MT
>> 5232 2179   41
>>
>>
>> R> ens.r <- getBM(attributes = c("ensembl_transcript_id",
>> "chromosome_name", "strand", "transcript_start", "transcript_end"),  
>> mart
>> = ensembl)
>> R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,]
>> R> table(ens.r$chromosome_name)
>>
>>   1   10   11   12   13   14   15   16   17   18   19    2    3    4
>> 5    6    7    8    9   MT    X    Y
>> 2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 2179 3997
>> 2822 2524 3919 2021 2163   41 3297   17
>>
>>
>>
>> R> sessionInfo()
>> R version 2.10.0 (2009-10-26)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>>
>> [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C
>> LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>>
>> other attached packages:
>> [1] biomaRt_2.2.0
>>
>> loaded via a namespace (and not attached):
>> [1] RCurl_1.3-0  tools_2.10.0 XML_2.6-0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Rhoda Kinsella Ph.D.
Ensembl Bioinformatician,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.