Session: Technology & Bioinformatics
S02_03.html
Only a small fraction of the proteins encoded in any of the sequenced microbial genomes have ever been studied experimentally. Functional assignments of the rest of them simply use the annotations from the the best database hits, which requires setting arbitrary (high) cut-off similarity scores. The recent (Dec. 1, 2000) release of the Clusters of Orthologous Groups (COG) database (http://www.ncbi.nlm.nih.gov/COG) contains groups of orthologous proteins (domains) from 30 publicly available complete genomes, identified in all-against-all sequence comparisons. Using COGs as a reference set for similarity searches eliminates the most common sources of errors in functional assignment of new proteins (unreliable database entries, divergence of sequences and functions in the course of evolution, multi-domain organization of proteins, and low sequence complexity) and substantially improves the overall quality of function prediction. The annotation of protein families (COGs), as opposed to individual proteins, allows one to fine-tune the family assignments based on the diversity of proteins in each particular COG. The information about better studied microorganisms (E. coli, B. subtilis, yeast) can thus be used for genome analysis of poorly studied organisms. Analysis of phylogenetic patterns, domain fusions, and genome neighborhoods provides additional tools for recognition of untraslated, mis-annotated and unannotated proteins. Superimposing the COGs on the biochemical pathways map can be used to identify the metabolic pathways that are present or absent in any given organism.