Computational Biology Centers, Academic Health Center, University of Minnesota, Box 43 Mayo, 420 Delaware St SE, Minneapolis, MN 55455 USA
Current methods for clustering protein sequences into families are a valuable tool for identifying function. There are nearly a dozen such classifications in widespread use. Each was created with a different objective in mind, and can be used to characterize unknown sequences in subtly different ways. We discuss preliminary results of a set-theoretic comparison and practical-utility study of several publically-accessible databases (PIR1, DOMO, ProDom, Pfam, PROSITE, SBASE, BLOCKS, PRINTS, SYSTERS, and PROTOMAP). In comparing such a diverse set, we focus on overlap/consensus, supersets/subsets, and strengths/weaknesses for various research objectives. We do not attempt to assess 'winners' and 'losers', given the varied purposes for which each database was created.