January 11-15, 2003
Town & Country Convention Center
San Diego, CA
Poster: Genome Sequencing & ESTs
The Arabidopsis thaliana and Oryza sativa genomes have built a clear functional and structural foundation for the first large-scale expeditions into comparative plant genomics. While the notable void in other completed ‘model’-plant genomes remains, an incredible number of ESTs are being produced from a wide variety of plants. EST annotation and analysis systems such as Sputnik endeavour to apply some form of meaning to the these sequences in terms of homology, structure or function. This poster attempts to reveal insight on the content of large EST collections by correlating available EST and unigene data back to the parent genomes, and by investigating what doesn’t seem to make sense.
175,000 Arabidopsis thaliana ESTs and 110,000 ESTs from Oryza sativa have been subjected to a state-of-the-art clustering and annotation pipeline. Close analysis of the unigene annotation reveals that EST libraries contain a mixture of the expected, the unexpected and the unexplainable. The preparation of a database of likely contaminant sequences reveals that the EST collections are largely clean of human, E.coli and phage contaminants. However, the EST collections contain a significant background of sequences that correspond to the genomic scaffold but which are absent from the annotated gene list. These sequences represent small peptides that have escaped annotation, non-coding RNAs or may stem from genomic DNA contaminants. A small proportion of sequences (~7.5%) cannot be explained as contaminants and show no homology to the host genome. These sequences may represent genes from the remaining gaps in the genomes (the centromeres in Arabidopsis), or may represent hitherto unidentified contaminants or pathogens. The development of an algorithm to explain host / non-host sequences is presented in the context of the analysis and annotation of plant-pathogen mixed cDNA libraries.