January 10-14, 2009
Town & Country Convention Center
San Diego, CA
Finding protein-coding genes is the most important goal of a eukaryotic genome sequencing project. However, the task of eukaryotic gene prediction is challenging. Gene identification by cDNA/EST mapping to genomic DNA or inferring gene models from alignments with closely related genomes require either abundant cDNA/EST data and/or mere availability of a reference genome. Conventional statistical ab initio methods require large training sets of validated genes for estimating species specific parameters of the algorithm. In practice, neither EST/cDNA or validated genes might be available in sufficient amount until rather late stages of genome sequencing. We have developed an expectation-maximization type of ab initio gene prediction algorithm which carries out eukaryotic gene finding in parallel with estimation of algorithm parameters. The new method follows the path of the iterative Viterbi training. For a large spectrum of eukaryotic genomes, from fungi to plants, insects and animals, we have shown that the new method performs comparably or better than conventional methods that use supervised model training. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step.