[BioC] biomart to a data.frame

Martin Morgan mtmorgan at fhcrc.org
Thu Jan 26 11:14:37 CET 2012


On 01/26/2012 01:24 AM, Hans-Rudolf Hotz wrote:
>
>
> On 01/26/2012 08:28 AM, Assa Yeroslaviz wrote:
>> Hi Steve,
>>
>> thanks for the help.
>>
>> I know about the strsplit function and i used it to split each row on its
>> own by the ';' symbol.
>> The problem I have is that I need to keep the information of each row in
>> the row ( or at least to give it back after the biomaRt extraction).
>>
>> The table I have contains not only the protein IDs but also a lot of
>> other
>> stuff, which is connected to each of the proteins. This is why I need to
>> know which proteins came from which line (Id).
>>
>> It will be nice if there was a possibility to do it as you suggested.
>> Take
>> all the Protein IDs, write them into one vector and run them with
>> biomaRt.
>> But than I would like to be able to put them back together in a row-wise
>> fashion like I suggested at the beginning.
>>
>
> Hi
>
> Please allow me to jump in:
>
> If I understand your question correctly, then there is no other (easy)
> solution than querying biomart inside a loop.
>
> The problem is not the Bioconductor packagae biomaRt, but the actual
> biomart server behind the scene: Apparently there is now way to preserve
> the order of the input (or keep duplicates, or indicate which id does
> not have a result, etc).

If the original ids are in a data.frame

   df <- data.frame(FBpp=c("FBpp0070037", "FBpp0070039;FBpp0070040",
                      "FBpp0070041;FBpp0070042;FBpp0070043",
                      "FBpp0070044;FBpp0110571"),
                    stringsAsFactors=FALSE)

and the 'split' ids are

   ids <- strsplit(df$FBpp, ";")

then 'map' relates the ids to the row they come from:

   map <- rep(seq_len(nrow(df)), sapply(ids, length))
   names(map) <- unlist(ids)

so after querying biomaRt

   library(biomaRt)
   mart <- useMart('ensembl', dataset='dmelanogaster_gene_ensembl')
   ans <- getBM(attributes=c("flybase_translation_id", "flybase_gene_id",
                  "flybasename_gene"),
                filters="flybase_translation_id",
                values=names(map), mart=mart)

and writing a little helper function to 'unsplit' a character vector x 
into 'collapsed' strings based on a factor f

   strunsplit <- function(x, f, collapse=";")
   {
       sapply(split(x, f), paste, collapse=collapse)
   }

the original data.frame can be updated as

   FBgn <-
       strunsplit(ans$flybase_gene_id, map[ans$flybase_translation_id])
   df$FBgn[as.integer(names(FBgn))] <- FBgn

I guess the contortions occur because of the original data.frame. A 
different representation with the same information, assuming 'Id' is the 
criterion for joining the FBbb ids in the first place, is

   > df1 <- data.frame(Id=map, FBpp=names(map), row.names=NULL)
   > df1
     Id        FBpp
   1  1 FBpp0070037
   2  2 FBpp0070039
   3  2 FBpp0070040
   ...

Martin

>
> I recently asked the biomart folks about this issue, and the answer was
> that I need to post-process the output to get my original order back - I
> was lazy, and I queried the server in a loop (for my defense: it was
> only a handful of ids)
>
>
> Regards, Hans
>
>> Thanks again
>> Assa
>>
>> On Wed, Jan 25, 2012 at 16:02, Steve Lianoglou<
>> mailinglist.honeypot at gmail.com> wrote:
>>
>>> Hi Assa,
>>>
>>> Sorry for top posting.
>>>
>>> Your intuition is correct: you should not being querying biomart
>>> inside a for loop. The idea is to create one query for all of your
>>> protein IDs, and query it once.
>>>
>>> This is how you might go about it. First, let's look at the protein
>>> IDs you already seem to have somewhere:
>>>
>>>> 45 FBpp0070037
>>>> 46 FBpp0070039;FBpp0070040
>>>> 47 FBpp0070041;FBpp0070042;FBpp0070043
>>>> 48 FBpp0070044;FBpp0110571
>>>
>>> It seems you have multiple IDs jammed into one column of a data.frame
>>> maybe? The rows which have more than one ID, (eg.
>>> "FBpp0070039;FBpp0070040") will have to be split up so that each row
>>> (or element in a vector) only has one ID. Look into using `strsplit`.
>>>
>>> You will need to get a character vector of protein ids -- one protein
>>> per bin, it might look like so:
>>>
>>> pids<- c('FBpp0070037', 'FBpp0070039', 'FBpp0070040', 'FBpp0070041',
>>> 'FBpp0070042', 'FBpp0070043')
>>>
>>> Now ... you're basically done. Let's rig up an object to query biomart
>>> with:
>>>
>>> library(biomaRt)
>>> mart<- useMart('ensembl', dataset='dmelanogaster_gene_ensembl')
>>> ans<-
>>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),
>>>
>>> filters="flybase_translation_id", values=pids,
>>> mart=mart)
>>>
>>> Your answer will look like so:
>>>
>>> flybase_translation_id flybase_gene_id flybasename_gene
>>> 1 FBpp0070037 FBgn0010215 alpha-Cat
>>> 2 FBpp0070039 FBgn0052230 CG32230
>>> 3 FBpp0070040 FBgn0052230 CG32230
>>> 4 FBpp0070041 FBgn0000258 CkIIalpha
>>> 5 FBpp0070042 FBgn0000258 CkIIalpha
>>> 6 FBpp0070043 FBgn0000258 CkIIalpha
>>>
>>> Now you're left with figuring out what to do with multiple
>>> "flybase_translaion_id"s that map to the same "flybasename_gene".
>>>
>>> You would have to do this anyway, but the key point here is that you
>>> can now do it without querying biomart in a loop.
>>>
>>> HTH,
>>> -steve
>>>
>>>
>>>
>>>> For each of these protein Ids (FBpp...), I would like to extract the
>>>> gene
>>>> id (Fbgn....) in a third column. the output table should looks like
>>>> that:
>>>>
>>>> 45 FBpp0070037 FBgn001234
>>>> 46 FBpp0070039;FBpp0070040 FBgn00094432;FBgn002345
>>>> 47 FBpp0070041;FBpp0070042;FBpp0070043
>>> FBgn0001936;FBgn000102;FBgn004527
>>>> 48 FBpp0070044;FBpp0110571 FBgn0097234;FBgn00183
>>>> ...
>>>>
>>>> I was thinking using biomaRt, but I could find a way of automating
>>>> it for
>>>> the complete protein ids in the line.
>>>>
>>>> What I have done so far is this for loop:
>>>>
>>>> for(i in 1:dim(data)[1]){
>>>> temp=unlist(strsplit(data[i,2],";"))
>>>> temp= gsub("REV__", "", temp)
>>>> result=
>>>>
>>> getBM(attributes=c("flybase_translation_id","flybase_gene_id","flybasename_gene"),filters="flybase_translation_id",values=temp,
>>>
>>>> mart=mart, )
>>>> charresult =""
>>>> for (j in 1:length(result[[1]])) {
>>>> # charresult<-paste(charresult,">",
>>>> result[[1]][j],":",result[[2]][j], "\t", sep="")
>>>> charresult<-paste(charresult, result[[2]][j], ";", sep="")
>>>> }
>>>> out<-"CompleteResults.txt"
>>>> cat("line: ", i-1,"\t", "was written to ");cat(out);cat("\n")
>>>> write.table(paste(i-1, charresult, sep="\t"),out, sep="\t", quote=F,
>>>> col.names=F, row.names=F,append=T)
>>>> }
>>>>
>>>> What I am doing is converting the string of FBpp Ids into a character
>>>> vector and than run each line into the getBM command. I first think it
>>> is a
>>>> bad idea, as I am using a loop to inquire an online data base, but i
>>> don't
>>>> have a better option at the moment.
>>>>
>>>> The second problem is that it just takes a lot of time.
>>>>
>>>> I would appreciate your Ideas, If there is a better/faster way of doing
>>> it
>>>>
>>>> Thanks A.
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>> | Memorial Sloan-Kettering Cancer Center
>>> | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list