Molecular Statistics and Bioinformatics Section, National Cancer Institute, Bethesda, MD 20892-9015, USA
Databases containing many thousand gene expression array experiment results are available and growing rapidly. As the databases grow in size, an appropriate similarity metric becomes increasingly important for searching, as well as for clustering and other applications of the data. Previous work has treated the set of simultaneously measured expression levels (or ratios) as vectors, and used either Euclidean distance or a correlational measure such as the vector dot product to define similarity. However, the data from these experiments have several features that work against using these similarity metrics. First, due to shared promoters and other regulatory controls, gene expression levels are highly correlated and have complex covariational structure. Similarity measures that assume independence among the genes will make significant errors in this context. Second, the observed distributions of expression levels for individual genes are also complex. Thousands of genes appear to exhibit significant multimodality, and even the unimodally-distributed genes rarely appear to be Gaussian. These distributional irregularities mean that a particular difference in expression level (or ratio) may have different meaning in different parts of the expression range. We propose a Bayesian similarity measure over experiments that addresses these issues by using the database itself to estimate the odds that two experiments were samples of one vs. two different underlying distributions. In addition to the theoretical measure, we present computationally tractable approximations to it.