PAG-XVII  Plant & Animal Genomes XVII Conference

January 10-14, 2009
Town & Country Convention Center
San Diego, CA



W065 : Bioinformatics


Acceleration Of Genome Annotation With Machine Learning

Mark Borodovsky

  Wallace H. Coulter Department of Biomedical Engineering, Computational Science and Engineering Division, Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology and Emory University, Atlanta, Georgia 30332 USA

Finding protein-coding genes is the most important goal of a eukaryotic genome sequencing project. However, the task of eukaryotic gene prediction is challenging. Gene identification by cDNA/EST mapping to genomic DNA or inferring gene models from alignments with closely related genomes require either abundant cDNA/EST data and/or mere availability of a reference genome. Conventional statistical ab initio methods require large training sets of validated genes for estimating species specific parameters of the algorithm. In practice, neither EST/cDNA or validated genes might be available in sufficient amount until rather late stages of genome sequencing. We have developed an expectation-maximization type of ab initio gene prediction algorithm which carries out eukaryotic gene finding in parallel with estimation of algorithm parameters. The new method follows the path of the iterative Viterbi training. For a large spectrum of eukaryotic genomes, from fungi to plants, insects and animals, we have shown that the new method performs comparably or better than conventional methods that use supervised model training. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step.