[BioC] GO terms for E. coli micro arrays (ecoliK12.db generation)

Thu Jun 11 19:19:52 CEST 2009

Hej,

Has anybody succeeded in constructing an ecoliK12.db database with usable 
Gene Ontology annotations? topGO is an nice R package that works very well 
with the yeast genome, and I would like to use it with E. coli but almost 
no GO terms are apparently available for E. coli when using the tools 
provided by AnnotationDbi.

No ecoliK12.db database exists in the repositories, but according to the 
documentation in the AnnotationDbi package, this should be very easy with 
the makeECOLICHIP_DB command from that package.

However, code didn't run. First some modifications had to be done to the 
AnnotationDbi package. I downloaded the sourcecode 
(AnnotationDbi_1.6.0.tar.gz). I also made sure that ecoliK12.db0 was 
installed (version ecoliK12.db0_2.2.11.tar.gz was used).

In the directory AnnotationDbi/R, the 2 following files were modified.

sqlForge_baseMapBuilder.R:

Comment out line 234:

sql <- "INSERT INTO probe2gene SELECT DISTINCT m.probe_id, u.gene_id \
FROM min_other_rank as m INNER JOIN src.unigene as u WHERE \
m.gene_id=u.unigene_id;"

and line 235:

sqliteQuickSQL(db, sql)

Otherwise one gets the error:

RS-DBI driver: (error in statement: no such table: src.unigene)

sqlForge_tableBuilder.R

Comment out line 181:

sqliteQuickSQL(db, "ANALYZE;")

and lines 3179 and 3180:

sqliteQuickSQL(db, "VACUUM probe_map;")
sqliteQuickSQL(db, "ANALYZE;")

Otherwise one gets the error:

RS-DBI driver: (RS_SQLite_exec: could not execute1: attempt to write a 
readonly database)

The package was retarred and installed with:

R CMD INSTALL AnnotationDbi_1.6.0.tar.gz

I also downloaded the annotation file for the ecoli2 array from affymetrix 
(E_coli_2.na28.annot.csv).

R was started and the following commands were given (make sure the 
directory 'ecoliK12.db' exist, also the path to the site-library may 
vary):

library(AnnotationDbi);library(ecoliK12.db0)

makeECOLICHIP_DB(affy=TRUE,prefix='ecoliK12',fileName="E_coli_2.na28.annot.csv",
baseMapType='eg',chipSrc='/usr/local/lib/R/site-library/ecoliK12.db0/extdata/chipsrc_ecoliK12.sqlite',
chipMapSrc='/usr/local/lib/R/site-library/ecoliK12.db0/extdata/chipmapsrc_ecoliK12.sqlite',
chipName='E_coli_2',outputDir='ecoliK12.db',version='2.2.11')

In the ecoliK12.db directory, another ecoliK12.db directory was created by 
those R commands. This directory was tarred (tar -czf 
ecoliK12.db_2.2.11.tar.gz ecoliK12.db/) resulting in an installable 
package that technically works with topGO.

But not many GO terms are associated with the probes; much less than the 
number of GO terms that can be found for each probe in the probe 
annotation file provided by affymetrix.

The table below lists the number of GO terms found in the different tables 
for the three ontologies:

MG1655: the number of GO terms annotated to the probes of MG1655, as found 
in the affymetrix probe annotation file (that array contains also probes 
for other E. coli; they are filtered out for this table).

ecoliK12.db: the number of GO terms that are found in the database 
generated by the makeECOLICHIP_DB command from above.

ecoliK12.db0: the number of GO terms that are found in the original 
database that makeECOLICHIP_DB uses for generating the ecoliK12.db 
database. It should be noted that also in that database, no evidence codes 
occur (the evidence column has everywhere the value '-').

                     GO_BP_all  GO_CC_all  GO_MF_all
MG1655              9899       7023       17925
ecoliK12.db         6394       2999       211
ecoliK12.db0        33526      17367      1266

(the comparison was done with the _all tables from the database, to be 
able to compare with the affymetrix file)

Why are there not more GO terms found in the ecoliK12.db? Using other 
baseMapType than 'eg' does not help. Only 'refseq' doesn't crash, but even 
less GO terms are obtained than with 'eg'. Furthermore, for refseq, I 
think some modification has to be done to the cleanRefSeqs function in 
sqlForge_baseMapBuilder.R (the line with baseMap[,2] = sub("\\.\\d+?$", 
"", baseMap[,2], perl=TRUE) should be changed to baseMap[,2] = 
sub("^[^_]*_([^_]*)_.*", "\\1", baseMap[,2], perl=TRUE)).

Trying to add the GO terms of the affymetrix file afterwards to the 
database, doesn't work (no better results in topGO: still only few (less 
than 10) significant nodes when comparing aerobic with anaerobic grown 
cells giving more than 2000 differently expressed genes).

A possible problem might be that affymetrix provides also the redundant GO 
terms to the probes and that I added all those to the GO_XX and GO_XX_all 
tables. The GO_XX tables should normally only contain the most specific GO 
terms.

Is this a known problem? Should I give up doing GO analysis with topGO for 
E. coli? Or is there a workaround?

The R version used is R version 2.7.1 (2008-06-23) on Debian (however for 
AnnotationDbi and ecoliK12.db0 the most recent versions were downloaded 
from the bioC website, together with their dependencies) .

Thank you very much for any suggestions,

Gaspard