A Probabilistic Generative Model for GO Enrichment Analysis

Abstract:
Motivation: The Gene Ontology (GO) is extensively used to analyze all types of high-throughput experiments. However, researchers still face several challenges when using GO and other functional annota-tion databases. One problem is the large number of multiple hy-potheses that are being tested for each study. In addition, categories often overlap with both direct parents/descendents and other distant categories in the hierarchical structure. This makes it hard to deter-mine if the identified significant categories represent different func-tional outcomes or rather a redundant view of the same biological processes.

Results: To overcome these problems we developed a generative probabilistic model which identifies a (small) subset of categories that, together, explain the selected gene set. Our model accommo-dates noise and errors in the selected gene set and GO. Using con-trolled GO data our method correctly recovered most of the selected categories, leading to dramatic improvements over current methods for GO analysis. When used with microarray expression data and ChIP-chip data from yeast and human our method was able to cor-rectly identify both general and specific enriched categories which were overlooked by other methods.