[BioC] help with biomaRt bioconductor - Filter upstream_flank NOT FOUND problem

Tue Sep 25 14:45:01 CEST 2012

2012/8/9 Steffen Durinck <durinck.steffen at gene.com>:
> Thanks for the code example Wolfgang,
>
> The stochasticity suggests the problem is on the BioMart server side, I'll
> contact them to see if they can look into it.

Could anybody fix the problem or got responds from the helpdesk?

Best
Stefan

>
> On Tue, Aug 7, 2012 at 2:08 AM, Wolfgang Huber <whuber at embl.de> wrote:
>
>> Dear Steffen / List,
>> below is a more compact code example that reproduces Tom's problem. I am
>> rather confused by the fact that the problem seemed to occur stochastically!
>>
>> -------------------
>> library(biomaRt)
>> options(error=recover)
>> ensembl = useMart("ensembl")
>>
>> human = useDataset("hsapiens_gene_**ensembl",mart=ensembl)
>> attr = c('ensembl_gene_id','ensembl_**transcript_id',
>>
>>        'external_gene_id','**chromosome_name','strand','**
>> transcript_start')
>> bmres = getBM(attr, 'biotype', values = 'protein_coding', human)
>>
>> for(id in bmres[,"ensembl_transcript_id"**]){
>>  sequence = getSequence(id=id, type='ensembl_transcript_id',
>>
>>                        seqType='transcript_flank',**upstream = 3000,
>>                        mart = human)
>>  sl = with(sequence, nchar(as.character(transcript_**flank)))
>>  cat(id, sl, "\n")
>> }
>> -------------------
>>
>> One running this once, I got
>> ...(lots of lines)
>> ENST00000520540 3000
>> ENST00000519310 3000
>> ENST00000442920 3000
>>
>> Error in getBM(c(seqType, type), filters = c(type, "upstream_flank"),  :
>>   Query ERROR: caught BioMart::Exception::Usage: Filter upstream_flank NOT
>> FOUND
>>
>> The next time, the same error already occurred in the very first iteration
>> of the for-loop, for id="ENST00000539570". The next time, in the third
>> iteration for id="ENST00000510508".
>>
>> Any idea what is going on here?
>>
>>
>> Further comments:
>> - for *Steffen*: The documentation and the code of 'getSequence' do not
>> seem to match each other (e.g. the description of argument 'seqType'),
>> MySQL mode is mentioned but afaIu is not supported any more -> perhaps some
>> maintenance would be nice to users.
>> - for *Tom*: Making these queries (such as getSequence) within a for-loop
>> is bad practice, since it needlessly clogs the network and the BioMart
>> webservers. Please use R's vector-capabilities, e.g.
>>
>> ------------------------
>> sequence = getSequence(id=bmres[,"**ensembl_transcript_id"],
>>   type='ensembl_transcript_id', seqType='transcript_flank',
>>
>>   upstream = 3000, mart = human)
>> sl = with(sequence, nchar(as.character(transcript_**flank)))
>> -------------------------
>>
>> Best wishes
>>         Wolfgang
>>
>>
>> Tom Hait scripsit 08/06/2012 12:37 PM:
>>
>>  Hello,
>>>
>>> I'm a student in bioinformatics in Tel Aviv University.
>>> I'm working with you biomaRt API in order to generate automatically FASTA
>>> sequences downloading.
>>> I experienced some problem, here is my code:
>>>
>>> #open biomart libaray
>>> library(biomaRt)
>>> #open data set of human
>>> human = useDataset("hsapiens_gene_**ensembl",mart=ensembl)
>>> #select the attributes that we want from the data set
>>> attr<-c('ensembl_gene_id','**ensembl_transcript_id',
>>> 'external_gene_id','**chromosome_name','strand','**transcript_start')
>>> #downloading the map between transcript id and transcript name
>>> tmpgene<-getBM(attr, 'biotype', values = 'protein_coding', human)
>>> #save in a TSV format (the file is saved in txt)
>>> write.table(tmpgene,"Z:/**tomhait/organisms/human/**
>>> transcript_names.txt",
>>> row.names=FALSE, quote=FALSE)
>>> #collect all sequences with upstream flank 3000 bases based on the first
>>> column (ensembl_id) of tmpgene
>>> i<-1
>>> for(id1 in tmpgene[,2]){
>>>   #retrieve sequence
>>>   sequence<-getSequence(id=id1,
>>> type='ensembl_transcript_id',**seqType='transcript_flank',**upstream =
>>> 3000,
>>> mart = human)
>>>   #check if sequence was retrieved
>>>   sLengths <- with(sequence, nchar(as.character(transcript_**flank)))
>>>
>>> #writing to a new file in "Z:/tomhait/organisms/human/**
>>> mart_export_new.txt"
>>> #you can change it to "mart_export_new.txt" and it will create a new file
>>> in R directory
>>>   if(length(sLengths) > 0){
>>>    x<-sequence[,1]
>>>    y<-y<-strsplit(gsub("([[:**alnum:]]{60})", "\\1 ", x), " ")[[1]]
>>>    title<-paste(paste(">",**tmpgene[i,1],sep=""),tmpgene[**
>>> i,2],tmpgene[i,3],tmpgene[i,4]**,tmpgene[i,5],tmpgene[i,6],
>>> sep="|")
>>>    write(title,file="Z:/tomhait/**organisms/human/mart_export_**
>>> new.txt",ncolumns
>>> = 1, append=TRUE,sep="")
>>>    write(y,file="Z:/tomhait/**organisms/human/mart_export_**new.txt",ncolumns
>>> =
>>> 1, append=TRUE,sep="\n")
>>>    write("\n",file="Z:/tomhait/**organisms/human/mart_export_**
>>> new.txt",ncolumns
>>> = 1, append=TRUE,sep="\n")
>>>   }
>>>   i<-i+1
>>> }
>>>
>>> I got the message:
>>> Error in getBM(c(seqType, type), filters = c(type, "upstream_flank"),  :
>>>    Query ERROR: caught BioMart::Exception::Usage: Filter upstream_flank
>>> NOT
>>> FOUND
>>>
>>> Could you please help me to solve this problem?
>>>
>>> Best Regards,
>>>
>>> Tom Hait.
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>>
>> --
>> Best wishes
>>         Wolfgang
>>
>> Wolfgang Huber
>> EMBL
>> http://www.embl.de/research/**units/genome_biology/huber<http://www.embl.de/research/units/genome_biology/huber>
>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor