PAG-XIII  Plant & Animal Genomes XIII Conference

January 15-19, 2005
Town & Country Convention Center
San Diego, CA



P862 : Algorithms


Efficient Selection Of Unique And Popular Oligos For Large EST Databases

Jie Zheng1 , Timothy J. Close2 , Tao Jiang1 , Stefano Lonardi1

1  Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
2  Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA

EST unigene databases have grown exponentially in recent years and now represent the largest collection of genetic sequences. An important application of these databases is that they contain information useful for the design of gene-specific oligonucleotides (or simply, oligos) that can be used in PCR primer design, microarray experiments, and genomic library screening. We study two complementary problems concerning the selection of short oligos, e.g., 20--50 bases, from a large database of tens of thousands of EST unigene sequences: (i) selection of oligos each of which appears (exactly) in one unigene sequence but does not appear (exactly or approximately) in any other unigene sequence and (ii) selection of oligos that appear (exactly or approximately) in many unigenes. The first problem has applications in PCR primer and microarray probe designs; the second is useful in screening genomic libraries for gene-rich regions. We present an efficient algorithm to identify all unique oligos in the unigenes and an efficient heuristic algorithm to enumerate the popular oligos. The algorithms have been carefully engineered to achieve remarkable running times on regular PCs. Each algorithm takes hours (on a 1.2 GHz CPU, 3 GB RAM machine) to run on a dataset 37 Mbases of barley unigenes from the HarvEST database.