Paracel Inc. 80 South Lake Ave. #650, Pasadena, CA91101
Increasingly large quantities of partial gene sequences for model organisms such as Arabidopsis, Drosophila and Rice pose substantial challenges for software that processes the data to generate high-quality, longer consensus sequences for comparative genomics and drug discovery. We have developed a series of components to allow rapid and accurate clustering and assembly of ESTs, as well as large scale genome assembly. The Sequence Filtering and Masking package cleans and masks the data for low complexity regions, repetitive elements and clonevector sequences. The EST Clustering and Assembly package provides accurate and rapid clustering and assembly of both EST and full-length cDNA sequences. It identifies alternative splicing variants and provides multi-sequence alignments which can be viewed by a Java Cluster Viewer. We have successfully clustered Drosophila EST and Rat EST data. The Paracel Assembly package provides sensitive and accurate detection of sequence overlaps. It employs quality values for more accurate consensus and scales to accommodate BAC-sized and larger projects with efficient memory utilization. It is tailored for low-pass genome data assembly. By using clone pair constraints, it generates a 'scaffold' — a maximal sequence of ordered and oriented contigs. The assembly results are in an XML format file that can be displayed by a Java Viewer together with corresponding chromatograms. It also outputs ace (phrap-compatible) files that can be viewed with third party systems. A simulation on Yeast genome assembly shows that our approach is accurate, scalable and reliable.