[BioC] Annotating HGU133plus2 genes with number of coding changes
Steffen Durinck
durincks at mail.nih.gov
Thu Apr 19 15:46:51 CEST 2007
Hi Marco,
Ensembl maps everything to the transcript level and when there are
multiple transcripts for one gene, a query will return multiple hits for
that gene.
To see this better you could add the "ensembl_transcript_id" to your query:
probe.list <-
getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","affy_hg_u133_plus_2"),filters="affy_hg_u133_plus_2",
values=probes, mart=mart)
You'll see that you'll get a different transcript and that on this
level there is no redundancy.
The mapping to the transcript level is a choice of the Ensembl team and
we can not change this. It makes sense for other annotation information
such as protein domains, some alternative spliced transcripts might have
a certain domain and other transcripts of the same gene might not have
this domain. Or if you would query for 3'UTRs by mapping to the
transcript level you can retrieve all different UTRs associated with a
gene. Different transcripts of the same gene might even have different
functions and the current strategy would allow transcript specific GO
annotations...
Best regards,
Steffen
marco zucchelli wrote:
> Hi Steffen,
>
> one more question: In the example i reported before seems like some
> probes are reported twice,
> i.e. 207893_at is listed 2 times matched to the same gene ID. Totally
> the "probes" vector contains the probes from hgu133plus2 (54675) but
> the query returns 66565 rows.
>
> I do not understand really the meaning of this ..
>
> Regards
>
> Marco
>
> probe.list <-
> getBM(attributes=c("ensembl_gene_id","affy_hg_u133_plus_2"),filters="affy_hg_u133_plus_2",
> values=probes, mart=mart)
>
> head(probes.list)
>
> ensembl_gene_id affy_hg_u133_plus_2
> 1 ENSG00000184895 207893_at
> 2 ENSG00000184895 207893_at
> 3 ENSG00000129824 201909_at
> 4 ENSG00000129824 201909_at
> 5 ENSG00000067646 207247_s_at
> 6 ENSG00000067646 207246_at
>
>
>
> On 4/3/07, *Steffen Durinck * <durincks at mail.nih.gov
> <mailto:durincks at mail.nih.gov>> wrote:
>
> Hi Marco,
>
> It matches the transcripts and then maps those transcripts to the
> genes,
> even if you don't include the transcript id in the query.
> To see this you could set attributes =
> c("ensembl_gene_id","ensembl_transcript_id","affy_hg_u133_plus_2") in
> your query. Also if Ensembl didn't find a match for the affy
> probe then
> it won't be included in the output and if they find multiple matches
> then all of them will be returned.
>
> For the second part of your question: No, the ordering is random so
> you'll have reorder the output with e.g. the match function or loop
> over it.
>
> Cheers,
> Steffen
>
> marco zucchelli wrote:
> > Steffen,
> >
> > Anyway does this procedure match the affy_ID to the specific
> > transcript(s) that that probeset is targetting or does it match
> to it
> > to a gene and then gets all the available transcripts for the gene?
> >
> > Morover, it seems that the returned values from getBM are not
> ordered
> > as the input values.
> > Infact, if I use:
> >
> > head(probes)
> > [1] "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at"
> > "AFFX-BioC-5_at" "AFFX-BioC-3_at" "AFFX-BioDn-5_at"
> >
> > probe.list <-
> >
> getBM(attributes=c("ensembl_gene_id","affy_hg_u133_plus_2"),filters="affy_hg_u133_plus_2",
> > values=probes, mart=mart)
> >
> > head( probes.list)
> >
> > ensembl_gene_id affy_hg_u133_plus_2
> > 1 ENSG00000184895 207893_at
> > 2 ENSG00000184895 207893_at
> > 3 ENSG00000129824 201909_at
> > 4 ENSG00000129824 201909_at
> > 5 ENSG00000067646 207247_s_at
> > 6 ENSG00000067646 207246_at
> >
> > Is there any rule based on which the probes are ordered by getBM?
> > Or I am doing somethign wrong?
> >
> >
> > Marco
> >
> >
> >
> > On 3/30/07, *Steffen Durinck* <durincks at mail.nih.gov
> <mailto:durincks at mail.nih.gov>
> > <mailto:durincks at mail.nih.gov <mailto:durincks at mail.nih.gov>>>
> wrote:
> >
> > Hi Marco,
> >
> > You can do this with the biomaRt package (use the devel
> version, >=
> > 1.9.21) , here's how:
> >
> > library(biomaRt)
> > mart=useMart("ensembl", dataset="hsapiens_gene_ensembl")
> >
> getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","synonymous_snp_count","non_synonymous_snp_count"),
> > filters="affy_hg_u133_plus_2",
> values=c("201746_at","231640_at"),
> > mart=mart)
> >
> > it will give:
> >
> > ensembl_gene_id ensembl_transcript_id synonymous_snp_count
> > non_synonymous_snp_count
> > 1 ENSG00000141510 ENST00000269305
> > 5 20
> > 2 ENSG00000133703 ENST00000256078
> > 1 1
> > 3 ENSG00000133703 ENST00000311936
> > 1 1
> >
> >
> > Unfortunately you won't be able to get the affy id in the output
> > but you
> > can use biomaRt to map the Ensembl ids in the output back to the
> > afffy ids.
> >
> > Cheers,
> > Steffen
> >
> >
> > marco zucchelli wrote:
> > > Hi,
> > >
> > > I was wondering if it exists an annotation package for Affy
> > 133plus2
> > > reporting the number of synonymous & non synonymous
> changes for the
> > > genes on the array.
> > >
> > > If it does not exist does anybody has a good
> suggestion about
> > how to
> > > retrive this information from databases ?
> > >
> > >
> > > Marco
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at stat.math.ethz.ch
> <mailto:Bioconductor at stat.math.ethz.ch>
> > <mailto:Bioconductor at stat.math.ethz.ch
> <mailto:Bioconductor at stat.math.ethz.ch>>
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > <https://stat.ethz.ch/mailman/listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
> > > Search the archives:
> >
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >
> >
> >
>
>
> --
> Steffen Durinck, Ph.D.
>
> Oncogenomics Section
> Pediatric Oncology Branch
> National Cancer Institute, National Institutes of Health
> URL: http://home.ccr.cancer.gov/oncology/oncogenomics/
>
> Phone: 301-402-8103
> Address:
> Advanced Technology Center,
> 8717 Grovemont Circle
> Gaithersburg, MD 20877
>
>
--
Steffen Durinck, Ph.D.
Oncogenomics Section
Pediatric Oncology Branch
National Cancer Institute, National Institutes of Health
URL: http://home.ccr.cancer.gov/oncology/oncogenomics/
Phone: 301-402-8103
Address:
Advanced Technology Center,
8717 Grovemont Circle
Gaithersburg, MD 20877
More information about the Bioconductor
mailing list