[BioC] biomaRt- incorrect number of transcripts
Rhoda Kinsella
rhoda at ebi.ac.uk
Wed Nov 11 12:06:41 CET 2009
Dear Robert,
I have looked into this query and it seems that you did not retrieve
unique results from the Biomart interface. I have carried out your
query using the webExmaple.pl script provided in the biomart-perl
directory using unique for one query and not using it for a second run
of this query. When I do not select uniqueRows I get ~48000 rows and
when I select uniqueRows I get ~36000 rows. I have attached the XML
for the query I performed with uniqueRows selected (uniqueRows = "1").
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query virtualSchemaName = "default" formatter = "TSV" header = "0"
uniqueRows = "1" count = "" datasetConfigVersion = "0.7" >
<Dataset name = "mmusculus_gene_ensembl" interface = "default" >
<Filter name = "chromosome_name" value =
"1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,X,Y,MT"/>
<Attribute name = "ensembl_transcript_id" />
<Attribute name = "chromosome_name" />
<Attribute name = "strand" />
<Attribute name = "transcript_start" />
<Attribute name = "transcript_end" />
</Dataset>
</Query>
I hope this resolves the issue for you but please do not hesitate to
contact me if you need further clarification.
Kind regards
Rhoda
On 10 Nov 2009, at 19:25, Steffen at stat.berkeley.edu wrote:
> Dear Robert,
>
> Would it be possible to check if there are duplicates in the result
> you
> obtain via the web? By default biomaRt will retrieve only unique
> results,
> sometimes when you query over the web results are duplicated. To
> remove
> these you need to check the unique only checkbox when exporting your
> results to a file. Can you let me know if that explains the
> difference in
> number of transcripts you notice?
>
> Cheers,
> Steffen
>
>> Dear mailing list,
>>
>> I have recently observed a discrepancies in genome annotation
>> obtained
>> via R package biomaRt.
>> I wanted to download all ensembl transcripts from the entire mouse
>> genome (chromosome 1:19, X, Y MT only).
>>
>> When I set the filter based on chromosome names I retrieved ~36000
>> transcript, please see the code below.
>> However by using the web service www.biomart.org I received ~48000
>> transcripts for the same genome version and chromosomes.
>>
>> By comparing these two data frames you could see that the
>> discrepancies
>> in number of transcripts occur only for some chromosomes (3:9 and X).
>> If I specified only two chromosome names (2 and 3) than the number of
>> downloaded transcripts is correct for both of them.
>> If I did not set any filter in getBM function and did the filtering
>> manually in R, the number of transcripts is correct.
>>
>> Session info is attached.
>>
>> Best Regards
>> Robert
>>
>> --
>> Robert Ivanek
>> Postdoctoral Fellow Schuebeler Group
>> Friedrich Miescher Institute
>> Maulbeerstrasse 66
>> 4058 Basel / Switzerland
>> Office phone: +41 61 697 6100
>>
>>
>> R> library("biomaRt")
>> R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
>> R> chroms <- c(1:19,"X","Y","MT")
>> R> table(getBM(attributes = c("ensembl_transcript_id",
>> "chromosome_name", "strand", "transcript_start", "transcript_end"),
>> filters = "chromosome_name", values = chroms, mart =
>> ensembl)$chromosome_name)
>>
>> 1 10 11 12 13 14 15 16 17 18 19 2 3 4
>> 5 6 7 8 9 MT X Y
>> 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 1080 1454
>> 845 1209 1487 1129 1031 41 2072 17
>>
>> R> ens.web <- read.delim("../../../
>> mart_export.txt",stringsAsFactors=F)
>> R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,]
>> R> table(ens.web$Chromosome.Name)
>>
>> 1 10 11 12 13 14 15 16 17 18 19 2 3 4
>> 5 6 7 8 9 MT X Y
>> 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997
>> 2822 2524 3919 2021 2163 41 3297 17
>>
>> R> table(getBM(attributes = c("ensembl_transcript_id",
>> "chromosome_name", "strand", "transcript_start", "transcript_end"),
>> filters = "chromosome_name", values = c("2","3","MT"), mart =
>> ensembl)$chromosome_name)
>>
>> 2 3 MT
>> 5232 2179 41
>>
>>
>> R> ens.r <- getBM(attributes = c("ensembl_transcript_id",
>> "chromosome_name", "strand", "transcript_start", "transcript_end"),
>> mart
>> = ensembl)
>> R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,]
>> R> table(ens.r$chromosome_name)
>>
>> 1 10 11 12 13 14 15 16 17 18 19 2 3 4
>> 5 6 7 8 9 MT X Y
>> 2507 1869 4364 1501 1630 1624 1404 1522 1865 985 1245 5232 2179 3997
>> 2822 2524 3919 2021 2163 41 3297 17
>>
>>
>>
>> R> sessionInfo()
>> R version 2.10.0 (2009-10-26)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=C
>>
>> [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
>> LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>>
>> other attached packages:
>> [1] biomaRt_2.2.0
>>
>> loaded via a namespace (and not attached):
>> [1] RCurl_1.3-0 tools_2.10.0 XML_2.6-0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
Rhoda Kinsella Ph.D.
Ensembl Bioinformatician,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus,
Hinxton
Cambridge CB10 1SD,
UK.
More information about the Bioconductor
mailing list