[BioC] problem with biomaRt package using mart "snps", dataset "hsapiens_structvar", attribute "description"
Wolfgang Huber
whuber at embl.de
Fri Apr 1 00:52:09 CEST 2011
Dear Mick
thank you for the (almost - see below) reproducible report.
The bottomline is that R's read.table does not like newline (\n)
characters within quoted text ("), interpretes them as line ends, which
messes up the tab-delimited table that the BioMart query returns.
I suggest either of two possible solutions:
- The BioMart dataset is modified to abstain from putting \n and other
funny characters within quoted text
- the biomaRt package is modified to tolerate such behaviour
I am not sure how it would be possible to make the communication between
BioMart servers and its clients such as biomaRt more robust. Is there a
clear specification of BioMart servers' tab-delimited format and what
the legal characters are? This would certainly be helpful for people who
program clients.
I compacted your example into the following.
library("biomaRt")
options(error=recover)
ensembl.var <- useMart("snp")
sv <- useDataset("hsapiens_structvar", mart=ensembl.var)
x2 <- getBM(c("chrom_start", "chrom_end",
"structural_variation_name", "description"),
filters=c("chr_name"), values=list(6), mart=sv)
This generates the "error in scan(file, what, nmax, sep, dec, quote,
skip, nlines, na.strings, : line 135 did not have 4 elements". You then
get a menu from R's debugger. Enter "4" to get into the local evaluation
environment of the getBM function just before the error is thrown. Then,
type
cat(postRes, file="postRes.txt")
and open the file in a text editor, e.g. emacs. Lines 133-135 is:
269735 349386 esv29987 Levy 2007 "The diploid genome sequence of an
individual human.
" PMID:17803354 [remapped from build NCBI36]
Note that there are two newlines (\n) within the title of the paper,
which probably shouldn't be there. The same is also true at many other
places in the file, whenever the Levy paper is refered.
I leave it to Steffen to decide whether he wants to modify biomaRt; and
to you, whether you want to lobby with the curators of that dataset to
put more consistency in the 'description' field.
Hope this helps.
Wolfgang
PS: The line from your example code
useMart("snps")
resulted for me in an error message "Incorrect BioMart name, use the
listMarts function to see which BioMart databases are available". (There
is an extraneous "s"). Next time, please always send an exact transcript
of what you do, to make sure the problem is not due to a typing error.
Second, and more to the point of your question, t
Il Mar/31/11 5:25 PM, mmaguire ha scritto:
> To whom it may concern,
> I work in the DGVa group at EBI, this group works on structural variants. I ran into a problem using the R package biomaRt when attempting to retrieve information from the "snps" mart "hsapiens_structvar" dataset,
> here is my code with comments:
>
> Here is the R code that I've written:
>
> # Testing retrieval of SVs from Biomart
>
> library(biomaRt)
>
> # Select the version "ENSEMBL VARIATION 61 (SANGER UK)"
> ensembl.var<- useMart("snps")
>
> # Select SV dataset from the chosen mart
> sv<- useDataset("hsapiens_structvar", mart=ensembl.var)
>
> # Set attributes and filters for the chosen dataset and retrieve the data into a data frame
> chr6.svs<-getBM(c("chrom_start", "chrom_end", "structural_variation_name"), filters=c("chr_name"), values=list(6), mart=sv)
> # Check for returned data (brings back 65,532 rows for chromosome 6)
> summary(chr6.svs)
> # Write the data frame to a text file
> write.table( chr6.svs, file='chr6_svs_from_biomart.txt', sep="\t", quote=FALSE, append=FALSE, na="", row.names=FALSE )
>
>
> # Adding "description" to the vector of attributes in the above call to function "getBM()" causes the code to fail with the error given below.
> chr6.svs<- getBM(c("chrom_start", "chrom_end", "structural_variation_name", "description"), filters=c("chr_name"), values=list(6), mart=sv) # Does not work
> #Error returned by R when attempting to get the SV description attribute:
> # Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
> # line 135 did not have 4 elements
>
> The code fails when the SV "description" attribute is added. I think the problem arises due to the spaces in the "description" field with R incorrectly interpreting each space delimited word as vector element. My R is limited so I may be wrong. Anyway, I can run the same query from the web interface and correctly retrieve the "description" attribute.
> I've checked this with our Biomart person, Rhoda Kinsella, and the data in the Biomart looks correct and, as stated above, we can export it from the web interface.
> Any help gratefully received.
>
> Cheers
>
> Mick
>
>> Michael Maguire
>> Variation Archive Bioinformatician
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>>
>> Phone +44 1223 494674
>> Email mmaguire at ebi.ac.uk
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list