January 12-16, 2002
Town & Country Convention Center
San Diego, CA
Bioinformatics: Algorithms
A system for benchmarking fold and function assignments, as part of an ongoing effort to develop a pipeline for genome annotation, is described. The benchmark includes the following components: test set (1000 protein domains from scop, reliability categories (A - “Certain”, B - “Reliable”, C – “Probable”, D – “Possible”, E – “Potential”, F – “No annotation”) and a protocol to screen methods and incorporate them into the pipeline. In fold assignment, the reliability categories were projected into p-values of correct prediction as follows: A – 99.9%, B – 99.0%, C – 90.0%, D – 50.0%, E – 10.0%. The test set of 1000 protein domains from scop was built using the Astral resource with a 50% sequence non-identical threshold. To reduce the dependence of benchmarking from the content of the test set only a portion of the sample with a prediction score (e-value or z-score) close to the optimal cutoff was considered. The decision about incorporation of a given method into the pipeline is based on its sensitivity and uniqueness for predictions relative to other candidate methods. Results of testing several sequence similarity search and fold recognition methods in this protocol are presented, particularly: NCBI BLAST, WU BLAST, PSI-BLAST, SSEARCH, THREADER2.5, 123D, and ORFeus. PSI-BLAST was tested in various modes and appeared to be superior to other methods in every reliability category, though most other methods contribute unique predictions not detected by PSI-BLAST. It is interesting to note that the optimal performance of PSI-BLAST was achieved at different selections with search options (number of iteration and e-value at profile building step) for different reliability categories. Reliability categories for alignment quality were based on p-values to achieve certain rmsd and coverage of the target by the template. Rmsd and coverage were obtained in the protocol similar to MaxSub. Details of data representation of fold and function assignment in the context of reliability categories for storing within relational model (using OracleTM DBMS) are provided.