January 12-16, 2002
Town & Country Convention Center
San Diego, CA
Poster: Genome Sequencing & ESTs
The chicken EST database (www.chickest.udel.edu) contains 5' end sequence data of >20,000 clones from a variety of cDNA libraries. It will expand to over 50,000 ESTs. In order to develop a unique collection of ESTs, the data must be partitioned into a non-redundant set of clusters. Unlike data generated from 3' end sequencing, it is difficult to identify unique cDNAs using the 5' end sequence. In the absence of complete genome information or an apparent overlap, a more sophisticated gene based or gene-name based approach must be implemented to form EST sets containing sequence from the same gene. We combined sequence matching with a semantic/concept match approach to clustering. After sequence clustering with Phrap, information is mined from the BLAST results of every contig and filtered for uninformative words. A summary is built [i.e., a set of couple (word,score)] based on the quality of the output, the match scores, and their reference in the sequence database. These summaries are used to build a similarity function between the contigs. For similarity above a given threshold, the relationship between contigs can be considered as edges on a graph. From this are extracted new clusters using connected parts or clique clustering. Results using the chicken database will be presented and evaluated for closely related genes. A Java script and user interface was developed to automate the process and annotate the EST clusters. It also can be used on EST projects from other species where genomic information is lacking.