[BioC] Annotating HGU133plus2 genes with number of coding changes

Thu Apr 19 15:46:51 CEST 2007

Hi Marco,

Ensembl maps everything to the transcript level and when there are 
multiple transcripts for one gene, a query will return multiple hits for 
that gene.
To see this better you could add the "ensembl_transcript_id" to your query:

probe.list       <- 
getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","affy_hg_u133_plus_2"),filters="affy_hg_u133_plus_2", 
values=probes, mart=mart)

You'll see that you'll get a different transcript and that on  this 
level there is no redundancy.
The mapping to the transcript level is a choice of the Ensembl team and 
we can not change this.  It makes sense for other annotation information 
such as protein domains, some alternative spliced transcripts might have 
a certain domain and other  transcripts of the same gene might not have 
this domain.  Or if you would query for 3'UTRs by mapping to the 
transcript level you can retrieve all different UTRs associated with a 
gene. Different transcripts of the same gene might even have different 
functions and the current strategy would allow transcript specific GO 
annotations...

Best regards,
Steffen

marco zucchelli wrote:
> Hi Steffen,
>
>  one more question: In the example i reported before seems like some 
> probes are reported twice,
> i.e. 207893_at is listed 2 times matched to the same gene ID. Totally 
> the "probes" vector contains the probes from hgu133plus2 (54675) but 
> the query returns 66565 rows.
>
> I do not understand really the meaning of this ..
>
> Regards
>
> Marco
>
> probe.list       <- 
> getBM(attributes=c("ensembl_gene_id","affy_hg_u133_plus_2"),filters="affy_hg_u133_plus_2", 
> values=probes, mart=mart)
>
> head(probes.list)
>
>   ensembl_gene_id affy_hg_u133_plus_2
> 1 ENSG00000184895           207893_at
> 2 ENSG00000184895           207893_at
> 3 ENSG00000129824           201909_at
> 4 ENSG00000129824           201909_at
> 5 ENSG00000067646         207247_s_at
> 6 ENSG00000067646           207246_at
>
>
>
> On 4/3/07, *Steffen Durinck * <durincks at mail.nih.gov 
> <mailto:durincks at mail.nih.gov>> wrote:
>
>     Hi Marco,
>
>     It matches the transcripts and then maps those transcripts to the
>     genes,
>     even if you don't include the transcript id in the query.
>     To see this you could set attributes =
>     c("ensembl_gene_id","ensembl_transcript_id","affy_hg_u133_plus_2") in
>     your query.  Also if Ensembl didn't find a match for the affy
>     probe then
>     it won't be included in the output and if they find multiple matches
>     then all of them will be returned.
>
>     For the second part of your question:  No, the ordering is random so
>     you'll have reorder  the output with e.g. the match function or loop
>     over it.
>
>     Cheers,
>     Steffen
>
>     marco zucchelli wrote:
>     > Steffen,
>     >
>     > Anyway does this procedure match the affy_ID to the specific
>     > transcript(s) that that probeset is targetting or does it match
>     to it
>     > to a gene and then gets all the available transcripts for the gene?
>     >
>     > Morover, it seems that the returned values from getBM are not
>     ordered
>     > as the input values.
>     > Infact, if I use:
>     >
>     > head(probes)
>     > [1] "AFFX-BioB-5_at"  "AFFX-BioB-M_at"  "AFFX-BioB-3_at"
>     > "AFFX-BioC-5_at"  "AFFX-BioC-3_at"  "AFFX-BioDn-5_at"
>     >
>     > probe.list       <-
>     >
>     getBM(attributes=c("ensembl_gene_id","affy_hg_u133_plus_2"),filters="affy_hg_u133_plus_2",
>     > values=probes, mart=mart)
>     >
>     > head( probes.list)
>     >
>     >   ensembl_gene_id affy_hg_u133_plus_2
>     > 1 ENSG00000184895           207893_at
>     > 2 ENSG00000184895           207893_at
>     > 3 ENSG00000129824           201909_at
>     > 4 ENSG00000129824           201909_at
>     > 5 ENSG00000067646         207247_s_at
>     > 6 ENSG00000067646           207246_at
>     >
>     > Is there any rule based on which the probes are ordered by getBM?
>     > Or I am doing somethign wrong?
>     >
>     >
>     > Marco
>     >
>     >
>     >
>     > On 3/30/07, *Steffen Durinck* <durincks at mail.nih.gov
>     <mailto:durincks at mail.nih.gov>
>     > <mailto:durincks at mail.nih.gov <mailto:durincks at mail.nih.gov>>>
>     wrote:
>     >
>     >     Hi Marco,
>     >
>     >     You can do this with the biomaRt package (use the devel
>     version, >=
>     >     1.9.21) , here's how:
>     >
>     >     library(biomaRt)
>     >     mart=useMart("ensembl", dataset="hsapiens_gene_ensembl")
>     >    
>     getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","synonymous_snp_count","non_synonymous_snp_count"),
>     >     filters="affy_hg_u133_plus_2",
>     values=c("201746_at","231640_at"),
>     >     mart=mart)
>     >
>     >     it will give:
>     >
>     >     ensembl_gene_id ensembl_transcript_id synonymous_snp_count
>     >     non_synonymous_snp_count
>     >     1 ENSG00000141510       ENST00000269305
>     >     5                       20
>     >     2 ENSG00000133703       ENST00000256078
>     >     1                        1
>     >     3 ENSG00000133703       ENST00000311936
>     >     1                        1
>     >
>     >
>     >     Unfortunately you won't be able to get the affy id in the output
>     >     but you
>     >     can use biomaRt to map the Ensembl ids in the output back to the
>     >     afffy ids.
>     >
>     >     Cheers,
>     >     Steffen
>     >
>     >
>     >     marco zucchelli wrote:
>     >     > Hi,
>     >     >
>     >     >  I was wondering if it exists an annotation package for Affy
>     >     133plus2
>     >     > reporting the number of synonymous & non synonymous
>     changes for the
>     >     > genes on the array.
>     >     >
>     >     > If it does not exist does anybody has a good
>     suggestion  about
>     >     how to
>     >     > retrive this information from databases ?
>     >     >
>     >     >
>     >     > Marco
>     >     >
>     >     > _______________________________________________
>     >     > Bioconductor mailing list
>     >     > Bioconductor at stat.math.ethz.ch
>     <mailto:Bioconductor at stat.math.ethz.ch>
>     >     <mailto:Bioconductor at stat.math.ethz.ch
>     <mailto:Bioconductor at stat.math.ethz.ch>>
>     >     > https://stat.ethz.ch/mailman/listinfo/bioconductor
>     >     <https://stat.ethz.ch/mailman/listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>     >     > Search the archives:
>     >    
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>     >     >
>     >
>     >
>
>
>     --
>     Steffen Durinck, Ph.D.
>
>     Oncogenomics Section
>     Pediatric Oncology Branch
>     National Cancer Institute, National Institutes of Health
>     URL: http://home.ccr.cancer.gov/oncology/oncogenomics/
>
>     Phone: 301-402-8103
>     Address:
>     Advanced Technology Center,
>     8717 Grovemont Circle
>     Gaithersburg, MD 20877
>
>

-- 
Steffen Durinck, Ph.D.

Oncogenomics Section
Pediatric Oncology Branch
National Cancer Institute, National Institutes of Health
URL: http://home.ccr.cancer.gov/oncology/oncogenomics/

Phone: 301-402-8103
Address:
Advanced Technology Center,
8717 Grovemont Circle
Gaithersburg, MD 20877