PAG-XIII  Plant & Animal Genomes XIII Conference

January 15-19, 2005
Town & Country Convention Center
San Diego, CA



P856 : Software


An Algorithm To Detect Coding Region, Frame, Chimera And Contamination In EST And Genome Sequences

Sucheta Tripathy , Brett . M. Tyler

  Virginia Bioinformatics Institute, Virginia Polytechnic and State University, Washington Street -1, Blacksburg, VA- 24060

Here we describe a program based primarily on codon usage statistics combined with pattern recognition to extract coding regions from the EST and genome sequences. The codon usage values, referred to as signal, is calculated for each frame with an adjustable window size. The signal values less than zero indicate non-coding regions and more than zero indicates coding region. This generates several potential coding regions for each frame, and the frame with the highest and longest coding signal value is marked as the correct coding frame of the sequence. The signal positions are correlated with the start and stop patterns in the correct frame to find the absolute positions of coding regions. A unique feature of the program, is that it can sensitively detect any contamination, chimera, frame shift error and insertion deletion sites in ESTs. This program is currently used for predicting the UTRs, contamination and chimeras for the 30,000 P.sojae ESTs and 64,000 ESTs of Soybean. The sensitivity and specificity of the program is quite high with full length ESTs (99%). In case of sequencing errors, the program predicts the possible start/stop positions from the signal values. We have also adapted the algorithm to detect putative genes in genome sequences missed by conventional gene prediction software. The program is written in C and takes several parameters such as the multiple sequences in fasta format, window size, the codon usage lookup table etc at the command line.


Return to the Intl-PAG home page.
For further assistance, e-mail help@www.intl-pag.org