[BioC] Changes in annotations?

Thu Jul 9 15:36:46 CEST 2009

Hi Alex,

We have taken this up as an internal topic of discussion and hopefully 
will be able to come up with a solution that is a bit better than what 
we are doing right now.

However, some of this is going to be unavoidable. The chips were based 
on UniGene build 133 (and we are now on build 219), so things are going 
to tend to move around. But I don't think this is the main reason for 
what you are seeing, as the latest Affy annotation file is still based 
on UCSC build hg18 (which is based on NCBI build 36.1), so these data 
are over three years old.

Instead, it appears that Affy have changed something about how they are 
annotating things. Some of this is an improvement, and some not so much.

For instance, 203315_at does interrogate NCK2:

http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=136510693&hgt.out3=10x&position=chr2%3A105876577-105876826&hgtgroup_map_close=0&hgtgroup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expression_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hgtgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndVar_close=1

whereas 232583_at does not:

http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=136510693&hgt.out3=10x&position=chr2%3A105752637-105752886&hgtgroup_map_close=0&hgtgroup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expression_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hgtgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndVar_close=1

so in this case they have improved their annotations.

In the case of 238900_at, they seem to have added in a bunch of EG IDs 
that are based on a model rather than being actual reviewed genes. In 
addition, that probeset seems to hit an intron, so adding in all this 
other annotation appears pointless:

http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=136511081&hgt.out3=10x&position=chr6%3A32653723-32653972&hgtgroup_map_close=0&hgtgroup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expression_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hgtgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndVar_close=1

What this really comes down to is the fact that you can't take anything 
for granted. Not only do you have to validate a set of significant 
results using some other technology, you also need to ensure that the 
chip you are using is actually measuring what you think prior to doing 
the validation.

It should be relatively painless to set up a pipeline where you could 
use the BSgenome.Hsapiens.UCSC.hg18 package along with functions from 
BSgenome/Biostrings to map probesets to the genome and then use the 
org.Hs.eg.db package to see if a given probeset is even interrogating 
the gene it is supposed to. One could then use rtracklayer to visualize 
things in the UCSC genome browser to make sure you weren't interrogating 
an intron.

Best,

Jim

Alex Sanchez wrote:
> Hi James and thanks for the help.
> 
> I know, of course, that what Bioconductor does is to take the 
> annotations from public data sources.
> I will now turn to affymetrix to see if someone there, or in their 
> forums, can explain why so many annotations have been turned into "NA's"
> 
> There seem to be two problems here
> - The first one, pointed by Loren in his message today is synonims. For 
> instance the first probeset in my list was 238900_at,
>   In the first version I used (BioC 1.9) the Gene Symbol provide by 
> getSymbol was HLA-DRB1 whereas now (BioC 2.4) it is LOC100133484
>   However the original symbol is still in the Affymetrix annotation file 
> "HG-U133_Plus_2.na28.annot.csv". It seems as if the new symbol is intended
>   to show the relation with the new Entrez ID (100133484).
> In any case it is ennoying but it can be assumed.
> - What seems worse is what happens to some annotations: If, for 
> instance,  I take the second probeset in my list, 232583_at, and I look 
> for it in the affy annotations file I find that apart some links to 
> databases as genebank it does not have a gene symbol or an entrez (or 
> most annotations anymore).
> In the previous version of the annotations this probeset was one of two 
> associated with gene NCK2 (232583_at;203315_at) .
> The problem is that the second probeset (203315_at) was not selected by 
> the analysis. That is the only evidence suggesting that gene NCK2 was 
> differentially expressed (232583_at) does not suggest it anymore.
> 
> Last, but in any case least, this is not happening to a few probesets. 
> My original list had 417 probesets. 210 have changed/synonimized their 
> gene symbols, but from these 210, 160 have become NA's
> 
> Seems too many to feel comfortable :-(
> 
> Thanks for the help
> 
> Alex
> 
> ............................................................................................................... 
> 
> Dr. Alex  Sánchez.
> Associate Professor. Statistics Department. University of Barcelona.
> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
> asanchez_at_ub.edu
> Statistics and Bioinformatics Unit
> Institut de Recerca. Hospital Universitari Vall 'Hebron
> Passeig Vall d'Hebron 112-119. 08034 Barcelona
> asanchez_at_ir.vhebron.net
> ............................................................................................................... 
> 
> 
> 
> 
> 
> ----- Original Message ----- From: "James W. MacDonald" 
> <jmacdon at med.umich.edu>
> To: "Alex Sanchez" <asanchez at ub.edu>
> Cc: <bioconductor at stat.math.ethz.ch>
> Sent: Monday, July 06, 2009 3:58 PM
> Subject: Re: [BioC] Changes in annotations?
> 
> 
>> Hi Alex,
>>
>> This is a question that comes up on the Bioc list fairly regularly, 
>> and the answer is in two parts:
>>
>> First, the annotations supplied in the various metadata packages 
>> supplied by BioC are *not* our annotations, but are simply a 
>> re-packaging of data we collect from various sources. As an example, 
>> we use the mappings of Affymetrix Probe ID to Entrez Gene ID from the 
>> annotation csv files you can download from the Affy website. We then 
>> map the Entrez Gene IDs to other annotation using primarily NCBI data. 
>> So if you go to Affy's netaffx site (free registration required) and 
>> query on say, 238900_at, you get this:
>>
>> https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U133_PLUS_2%3A238900_AT 
>>
>>
>> And you will note that the first Entrez Gene ID listed there is 
>> 100133484, which happens to be a defunct ID. However, this is the 
>> first of many listed there (and we need a one-to-one mapping), so we 
>> chose that one. A more likely Entrez Gene ID can be found further down 
>> the list, but we simply don't have the resources to figure out if 
>> there is a better choice in that list (for every reporter on every 
>> Affy chip we annotate). Nor do we have the resources to ensure that 
>> any of the mappings that Affy make are reasonable to begin with. We 
>> have to trust that they (with *way* more resources that us) are doing 
>> a reasonable job.
>>
>> The second part of the answer has to do with the 'moving target' 
>> aspect of Biological annotations. These data change all the time, and 
>> there is the recurring question of whether one should do an analysis 
>> and 'freeze' it to that point in time, or should the annotations be 
>> updated on a regular basis, with the realization that things can and 
>> will change?
>>
>> Without looking at each reporter ID you list, I can't say if the 
>> changes are due to Affy changing their annotation csv files, or to 
>> changing knowledge of the genome, but I suspect it is a combination of 
>> the two.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>>
>>
>> Alex Sanchez wrote:
>>> Hello
>>>
>>> I have had to review recently an analysis I did some time ago. This 
>>> was done on affymetrix hgu133plus2 chips with R 2.4 and BioC 1.9 I 
>>> have re-run the analyses using R 2.9 and BioC 2.4 (sessionInfo below).
>>> I have been surprised by the changes in the annotations: Many 
>>> probesets that had had an annotation have become NA's whereas some 
>>> have changed their symbol and their Entrez gene.
>>>
>>> To be specific I summarize my question with the top genes of my list
>>>
>>> The list I obtained 2 years ago is:
>>>
>>> probeset    locuslink    symbol
>>> 238900_at 3123 HLA-DRB1
>>> 232583_at 8440 NCK2
>>> 236307_at 60468 BACH2
>>> 223620_at 2857 GPR34
>>> 219759_at 64167 LRAP
>>> 201702_s_at 5514 PPP1R10
>>> 232882_at 2308 FOXO1A
>>> 213446_s_at 8826 IQGAP1
>>> 234033_at 9693 RAPGEF2
>>> 243006_at 2534 FYN
>>> 244648_at 54520 CCDC93
>>> 243691_at 23142 DCUN1D4
>>> 239264_at 60412 EXOC4
>>> 243546_at 143686 SESN3
>>> 205239_at 374 AREG
>>> 1565703_at 55520 ELAC1
>>> 244061_at 55843 ARHGAP15
>>> 230505_at 26037 SIPA1L1
>>> 242688_at 9320 TRIP12
>>> 1556474_a_at 285097 FLJ38379
>>> 232614_at 596 BCL2
>>> 1565689_at 3839 KPNA3
>>> 236685_at NA NA
>>> 225173_at 93663 ARHGAP18
>>> 241893_at 4249 MGAT5
>>>
>>> I used the following code to reproduce the issue with the annotations:
>>>
>>>
>>> #####################################################################
>>> ## Verification using R 2.9 & BioC 2.4
>>> #####################################################################
>>>
>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" , 
>>>> "219759_at" ,
>>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at", 
>>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" , 
>>> "243546_at" ,   "205239_at" ,
>>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" , 
>>> "1556474_a_at",
>>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" , 
>>> "241893_at")
>>>> library(hgu133plus2.db)
>>>> library(annotate)
>>>>
>>>> entrezs<- getEG(probes, "hgu133plus2")
>>>> symbols<- getSYMBOL(probes, "hgu133plus2")
>>>> sel2<- cbind(probes, entrezs, symbols)
>>>> sel2
>>>              probes         entrezs     symbols       238900_at 
>>> "238900_at"    "100133484" "LOC100133484"
>>> 232583_at    "232583_at"    NA          NA           236307_at 
>>> "236307_at"    NA          NA           223620_at    "223620_at" 
>>> "2857"      "GPR34"       219759_at    "219759_at"    "64167"     
>>> "ERAP2" 201702_s_at  "201702_s_at"  "5514"      "PPP1R10"     
>>> 232882_at "232882_at"    NA          NA           213446_s_at  
>>> "213446_s_at" "8826"      "IQGAP1"      234033_at    "234033_at"    
>>> NA          NA 243006_at    "243006_at"    NA          NA           
>>> 244648_at "244648_at"    NA          NA           243691_at    
>>> "243691_at"    NA NA           239264_at    "239264_at"    
>>> NA          NA 243546_at    "243546_at"    NA          NA           
>>> 205239_at "205239_at"    "374"       "AREG"        1565703_at   
>>> "1565703_at" "4089"      "SMAD4"       244061_at    "244061_at"    
>>> NA          NA 230505_at    "230505_at"    "145474"    "LOC145474"   
>>> 242688_at "242688_at"    NA          NA           1556474_a_at 
>>> "1556474_a_at" "285097"    "FLJ38379"    232614_at    "232614_at"    
>>> NA          NA 1565689_at   "1565689_at"   NA          NA           
>>> 236685_at "236685_at"    NA          NA           225173_at    
>>> "225173_at" "93663"     "ARHGAP18"    241893_at    "241893_at"    
>>> NA          NA
>>>> sessionInfo()
>>> R version 2.9.0 (2009-04-17) i386-pc-mingw32 locale:
>>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
>>> States.1252;LC_MONETARY=English_United 
>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base 
>>> other attached packages:
>>> [1] annotate_1.22.0       hgu133plus2.db_2.2.11 RSQLite_0.7-1 
>>> DBI_0.2-4             AnnotationDbi_1.6.0   Biobase_2.4.1       
>>> loaded via a namespace (and not attached):
>>> [1] xtable_1.5-5
>>> #############################################
>>>
>>> Many probesets seem to have changed.
>>> Can someone explain to me what is happening (or what may I be doing 
>>> wrong)?
>>>
>>> The same code does not work with R 2.4 but if I change hgu133plus2.db 
>>> by hgu133plus2 and getEG by getLL I obtain the original results:
>>>
>>> ###############################################
>>> ### Review of annotatons with R 2.4 and BioC 1.9
>>> ###############################################
>>>
>>> ### This code is executed on a clean new session with R 2. and BioC 1.9
>>>
>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" , 
>>>> "219759_at" ,
>>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at", 
>>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" , 
>>> "243546_at" ,   "205239_at" ,
>>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" , 
>>> "1556474_a_at",
>>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" , 
>>> "241893_at")
>>>> LLs<- getLL(rownames(sel), "hgu133plus2")
>>>> symbols<- getSYMBOL(rownames(sel), "hgu133plus2")
>>>> sel1<- cbind(probes, LLs, symbols)
>>>> sel1
>>>              probes         LLs      symbols    238900_at    
>>> "238900_at" "3123"   "HLA-DRB1" 232583_at    "232583_at"    "8440"   
>>> "NCK2" 236307_at    "236307_at"    "60468"  "BACH2"    223620_at    
>>> "223620_at" "2857"   "GPR34"    219759_at    "219759_at"    "64167"  
>>> "ERAP2" 201702_s_at  "201702_s_at"  "5514"   "PPP1R10"  232882_at    
>>> "232882_at" "2308"   "FOXO1"    213446_s_at  "213446_s_at"  "8826"   
>>> "IQGAP1" 234033_at    "234033_at"    "9693"   "RAPGEF2"  243006_at    
>>> "243006_at" "2534"   "FYN"      244648_at    "244648_at"    "54520"  
>>> "CCDC93" 243691_at    "243691_at"    "23142"  "DCUN1D4"  239264_at    
>>> "239264_at" "60412"  "EXOC4"    243546_at    "243546_at"    "143686" 
>>> "SESN3" 205239_at    "205239_at"    "374"    "AREG"     1565703_at   
>>> "1565703_at" "4089"   "SMAD4"    244061_at    "244061_at"    "55843"  
>>> "ARHGAP15" 230505_at    "230505_at"    "145474" "LOC145474"
>>> 242688_at    "242688_at"    "9320"   "TRIP12"   1556474_a_at 
>>> "1556474_a_at" "285097" "FLJ38379" 232614_at    "232614_at"    "596" 
>>> "BCL2"     1565689_at   "1565689_at"   "3839"   "KPNA3"    236685_at 
>>> "236685_at"    NA       NA         225173_at    "225173_at"    
>>> "93663" "ARHGAP18" 241893_at    "241893_at"    "4249"   "MGAT5"
>>>> sessionInfo()
>>> R version 2.4.1 (2006-12-18) i386-pc-mingw32 locale:
>>> LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252 
>>>
>>>
>>> attached base packages:
>>> [1] "tools"     "stats"     "graphics"  "grDevices"
>>> [5] "utils"     "datasets"  "methods"   "base"     other attached 
>>> packages:
>>>    annotate     Biobase hgu133plus2 "1.12.1"    "1.12.2"    "1.14.0" 
>>> ########################################################
>>>
>>> In summary. If I use R 2.4/BioC 1.9 I obtain the same results I 
>>> ibtained 2 years ago, but If I do the same steps using R2.9/BioC2.4 
>>> the results change dramatically.
>>> I have repeated the analyses using BioC 2.01 in R 2.7 and BioC 2.2 in 
>>> R 2.8 (results not shown here). BioC 2.0 yield the same as 1.9 and 
>>> BioC 2.2 the same as 2.4,
>>>
>>> Any help to understand what's happening would be appreciated
>>>
>>> Alex Sanchez
>>>
>>> ----------------------------------------------------------------------------------------------------- 
>>>
>>> Dr. Alex  Sánchez. Statistics Department. University of Barcelona.
>>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>>> asanchez_at_ub.edu
>>> Statistics and Bioinformatics Unit
>>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>>> asanchez_at_ir.vhebron.net
>>> ---------------------------------------------------------------------------------------------------- 
>>>
>>>
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> -- 
>> James W. MacDonald, M.S.
>> Biostatistician
>> Douglas Lab
>> University of Michigan
>> Department of Human Genetics
>> 5912 Buhl
>> 1241 E. Catherine St.
>> Ann Arbor MI 48109-5618
>> 734-615-7826
>>
> 
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826