[BioC] Extracting protein sequence associated to UCSC transcript id

Wed Jan 8 21:54:00 CET 2014

Hi Raffaele,

I was about to post the same thing as Herve just did, but he not only 
beat me to the punch, but he also suggested using the names field for 
Biotstrings (which is an elegant suggestion).

But if in addition if you should want to get pre-packaged translations 
from somewhere else, you can get some of those via biomaRt too.  Some 
can be accessed using the UniProt.ws package.

Here is an example of how you can also get that data via the UniProt.ws 
package (which talks to the UniProt web services):

library(UniProt.ws)
seqs <- select(UniProt.ws, trs, "SEQUENCE", "UCSC")

Which solution you should choose may depend on your use case.  For 
example if you seek data where post-translational modifications are 
recorded, you would definitely not want to be translating them 
yourself.  But translating them yourself gives you the advantage of 
being able to control which genome you pull sequences from.

Hope this helps you,

  Marc

On 01/08/2014 11:59 AM, Hervé Pagès wrote:
> Hi Raf,
>
> On 01/08/2014 10:25 AM, rcaloger wrote:
>> I found a way to extract the protein sequences querying the UCSC web 
>> page.
>> However, there should be a  more elegant way to do it.
>> library(RCurl)
>> trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>> myquery<- list()
>> for(i in 1:length(trs)){
>>      myquery[[i]] <-
>> getURL(paste("http://genome-euro.ucsc.edu/cgi-bin/hgGene?hgsid=195297095&hgg_do_getProteinSeq=1&hgg_gene=", 
>>
>> trs[i],sep=""))
>>      Sys.sleep(30)
>> }
>
> Your 3rd transcript is non-coding hence no protein sequence for it.
>
> Otherwise you get exactly what you wanted using Michael's suggestion:
>
>   library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>   txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>   cds_by_tx <- cdsBy(txdb, by="tx", use.names=TRUE)
>
> See which of your transcripts are coding:
>
>   > trs <- c("uc003mfv.3", "uc001ajb.1", "uc011asd.2")
>   > trs %in% names(cds_by_tx)
>   [1]  TRUE  TRUE FALSE
>
> Extract and translate the CDS sequences (note that here I choose to
> compute the full proteome but I don't have to):
>
>   library(BSgenome.Hsapiens.UCSC.hg19)
>   cds_seqs <- 
> extractTranscriptsFromGenome(BSgenome.Hsapiens.UCSC.hg19, cds_by_tx)
>   proteome <- translate(cds_seqs)
>   names(proteome) <- names(cds_seqs)
>
> Then:
>
>   > proteome
>     A AAStringSet instance of length 63691
>           width seq names
>       [1]   134 MSESINFSHNLGQLLSPPRCV...KGETQESVESRVLPGPRHRH* uc010nxq.1
>       [2]   306 MVTEFIFLGLSDSQELQTFLF...DMKTAIRQLRKWDAHSSVKF* uc001aal.1
>       [3]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc009vjk.2
>       [4]   390 MLLPPGSLSRPRTFSSQPLQT...CQQPQQAQLLPHSGPFRPNS* uc001aau.3
>       [5]   267 MAYLGPYPTSRQPPQMRLLPH...CQQPQQAQLLPHSGPFRPNS* uc021oeh.1
>       ...   ... ...
>   [63687]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgq.1
>   [63688]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgr.1
>   [63689]   425 MALPTPSDSTLPAEARGRGRR...AASLEAPLSEEEYRALLEEL* uc031tgs.1
>   [63690]   279 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgh.2
>   [63691]   280 MGKGNEDSDLHCSSIQCSTDQ...HPFPGQEITETVSGSDEAKL* uc011mgi.2
>
>   > trs <- c("uc003mfv.3", "uc001ajb.1")
>
>   > proteome[trs]
>     A AAStringSet instance of length 2
>       width seq names
>   [1]   204 MSGQRVDVKVVMLGKEYVGKTSL...TEDKGVDLGQKPNPYFYSCCHH* uc003mfv.3
>   [2]   498 MAAAGEGTPSSRGPRRDPPRRPP...GPEASSSWQAAHSCTPEPPAPR* uc001ajb.1
>
> You can drop the trailing "*" with
>
>   subseq(proteome[trs], end=-2)
>
> Cheers,
> H.
>
>>
>> It is interesting that in bioconductor there are no databases linking
>> transcripts to proteins
>> Cheers
>> Raf
>>
>> On 08/01/14 17:10, Michael Lawrence wrote:
>>> In theory, you should be able to get the cds regions using e.g. the
>>> Homo.sapiens package, but it seems kind of tough to retrieve those for
>>> UCSC
>>> Known Gene identifiers (assuming that is what you have). Marc Carlson
>>> could
>>> probably help more.
>>>
>>> Michael
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jan 8, 2014 at 6:31 AM, rcaloger <raffaele.calogero at unito.it>
>>> wrote:
>>>
>>>> Dear Michael,
>>>> thank for the kind suggestion but unfortunately it does not solve my
>>>> problem because, using the approach you are suggesting, I do not have
>>>> access to the position of the start codon for the different isoforms.
>>>> Cheers
>>>> Raf
>>>>
>>>>
>>>> On 07/01/14 16:44, Michael Lawrence wrote:
>>>>
>>>>> If you had the transcript coordinates (as GRangesList, perhaps 
>>>>> from an
>>>>> exonsBy() on a TranscriptDb) you could use
>>>>> extractTranscriptsFromGenome()
>>>>> and translate, see the GenomicFeatures vignette for an example.
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> On Tue, Jan 7, 2014 at 6:54 AM, rcaloger 
>>>>> <raffaele.calogero at gmail.com>
>>>>> wrote:
>>>>>
>>>>>   Hi,
>>>>>> In order to validate fusion products I need to be sure that the
>>>>>> peptides
>>>>>> encoded by the the two fused proteins are in the same frame.
>>>>>> I have now a function that allows to confirm the protein1 and 
>>>>>> protein2
>>>>>> have sequences located in the same frame.
>>>>>> However, I got stack to retrieve those proteins sequences from 
>>>>>> UCSC. I
>>>>>> did
>>>>>> not found a quick way to retrieve the protein sequence associated 
>>>>>> to a
>>>>>> UCSC
>>>>>> ID.
>>>>>> Indeed the protein sequence is present in the UCSC genome browser,
>>>>>> but I
>>>>>> do not know how to grab it.
>>>>>> Any suggestion?
>>>>>> Cheers
>>>>>> Raffaele
>>>>>>
>>>>>> -- 
>>>>>>
>>>>>> ----------------------------------------
>>>>>> Prof. Raffaele A. Calogero
>>>>>> Bioinformatics and Genomics Unit
>>>>>> MBC Centro di Biotecnologie Molecolari
>>>>>> Via Nizza 52, Torino 10126
>>>>>> tel.   ++39 0116706457
>>>>>> Fax    ++39 0112366457
>>>>>> Mobile ++39 3333827080
>>>>>> email: raffaele.calogero at unito.it
>>>>>>          raffaele[dot]calogero[at]gmail[dot]com
>>>>>> www:   http://www.bioinformatica.unito.it
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.
>>>>>> science.biology.informatics.conductor
>>>>>>
>>>>>>
>>>> -- 
>>>>
>>>> ----------------------------------------
>>>> Prof. Raffaele A. Calogero
>>>> Bioinformatics and Genomics Unit
>>>> MBC Centro di Biotecnologie Molecolari
>>>> Via Nizza 52, Torino 10126
>>>> tel.   ++39 0116706457
>>>> Fax    ++39 0112366457
>>>> Mobile ++39 3333827080
>>>> email: raffaele.calogero at unito.it
>>>>         raffaele[dot]calogero[at]gmail[dot]com
>>>> www:   http://www.bioinformatica.unito.it
>>>>
>>>>
>>
>>
>