PAG-VI: ARABIan NIGHTS, or TALES TOLD BY 1001 PROTEIN SEQUENCE FROM ARABIDOPSIS GENOME

PAG-VI  Plant & Animal Genome VI Conference

Town & Country Hotel, San Diego, CA, January 18-22, 1998.


P3

ARABIan NIGHTS, or TALES TOLD BY 1001 PROTEIN SEQUENCE FROM ARABIDOPSIS GENOME

ARCADY R. MUSHEGIAN

    Sequana Therapeutics, Inc., 11099 N Torrey Pines Rd. , La Jolla CA 92037

The arabidopsis genome consortium has sequenced >5% of the genome and annotated 1500 proteins. The first 1001 proteins from the contigs submitted by three North American and a European group were reanalyzed using tools from Sequana's Genome DragomanTM suite of strategies for high-throughput sequence analysis and function prediction. With appreciation of the incompleteness of the sequence data, we estimated the level of our understanding of protein sequences in arabidopsis. Arabidopsis proteins are highly conserved in evolution, with >65% of predicted proteins matching the entries in public databases from taxa other than higher plants. For >70%, a cellular role or a general biochemical function could be predicted based on sequence similarity. Half of the proteins belong to families of paralogs, and about one-third of those, or 15% of the whole protein set, are plant-specific families, even though some share conserved motifs with non-plant proteins. These ratios are emerging as invariants in proteome analysis, as they are consistent with similar numbers obtained for complete bacterial proteomes. They are unlikely to change as more plant sequence becomes available, although the repertoire of paralogous families will grow. Of interest are the results of comparison with the completely sequenced genome of a blue-green alga, Synechocystis sp.. Among the arabidopsis proteins that have a likely ortholog in bacteria, about 50% have a protein from Synechocystis as the best bacterial match. A fraction of these proteins does not have an established relation to chloroplast function and may represent displacements of general cellular functions by endosymbiont-derived genes. This fraction will be characterized in more detail. Similarity analysis suggested a small number of gene mispredictions, resulting in artifactual truncations or mergers of protein domains, as well as relatively large number of misannotations. Several simple ways to significantly improve annotation quality will be suggested. The high-quality function annotation is essential for better understanding of plant biochemistry and cellular signalling, as will be demonstrated using as the examples the epicuticular wax biosynthesis and other pathways.


Return to Previous Page or Intl-PAG Homepage