PAG-X  Plant, Animal & Microbe Genomes X Conference

January 12-16, 2002
Town & Country Convention Center
San Diego, CA


Workshop: Databases, Gene Systematics, and Nomenclature
            


CONSTRUCTION AND ANNOTATION OF ARABIDOPSIS GENE FAMILIES AT TIGR

Dongying Wu1 , Daniel Haft1 , Fan Yang1 , Jeremy Peterson1 , Brian Haas1 , Rama Maiti1 , Agnes Chan1 , Linda Hannick1 , Owen White1 , Chris Town1

1 The Institute for Genomic Research, 9712 Medical Center Drive, Rockville MD, USA.

The sequence of the Arabidopsis thaliana genome reveals the presence of approximately 25,500 genes. In the publication describing the entire genome (Nature, 2000), these proteins were clustered into families based upon BLASTP matches having E values on < 1e-20 and extending over at least 80% of the protein length. The resulting analysis identified a total of 11,601 protein types, including 2,677 paralogous families. These criteria for building protein families were more stringent than those that we had used during annotation and, by treating each protein sequence as a single entity, ignored the domain structure of many proteins that can provide insight into family relationships. As an alternative, we developed a domain-based approach that first identified all possible domains in the entire proteome and then clustered proteins into paralogous groups based upon shared domains. Pfam HMM profiles (version 5.4) were searched against the Arabidopsis proteome and the hits were combined with the original seed profiles to produce new HMM profiles which were then searched against the entire proteome again. 929 Pfam domains had at least one hit in the Arabidopsis genome; of these, 536 profiles have been iterated. A total of 13,730 out of 26,199 Arabidopsis proteins in the then current dataset had at least one Pfam hit. In order to extend the Pfam profiles, we removed from the Arabidopsis proteome all amino acid stretches containing Pfam domains. The remaining peptides (34,500) were searched all vs. all, clustered into homology-based groups using a link-score parameter and aligned to produce 4484 domain alignments. The 929 Pfam profiles and 4,484 novel domain alignments were then used to search the entire Arabidopsis proteome and all the resulting domain matches and their alignments were captured and stored in a relational database. Paralogous families were derived by requiring that proteins have all domains in common and at least 40% of their sequence covered by domains. This approach produced a total of 4,070 paralogous families of 2 or more members containing a total of 17,684 proteins, leaving 8,515 singletons. The paralogous families constructed by this approach are being used as part of our ongoing work to re-evaluate, update and standardize the annotation of the entire Arabidopsis genome. The domain-based information allows one to readily visualize the relationships between members of paralogous families. It is also possible to construct on-the-fly different sets of families based upon user-specified degrees of domain sharing. This project was supported by the National Science Foundation.


Return to Previous Page or Intl-PAG Homepage