Data Mining for Epistatic Genetic Interactions using “Random Chemistry”

 

 

Dr. Margaret J. Eppstein

Department of Computer Science

University of Vermont

 

Date: Monday, September 25, 2006

Time: 12:20 p.m. - 1:10 p.m.

Location: 367 Votey

 

 

Abstract

 

There are estimated to be on the order of 106 single nucleotide polymorphisms (SNPs) existing as standing variation in the human genome.  Certain combinations of these SNPs can interact in complex ways to predispose individuals for a variety of common diseases, even though individual SNPs may have no ill effects.  Detecting these epistatic combinations is a computationally daunting task.  Trying to use individual or growing subsets of SNPs as building blocks for detection of larger combinations of purely epistatic SNPs (e.g., via genetic algorithms or genetic programming) is no better than random search, since there is no predictive power in subsets of the correct set of epistatically interacting SNPs.  Here, we explore the potential for hill-climbing from the other direction; that is, from large sets of candidate SNPs to smaller ones.  This approach was inspired by Kauffman’s “random chemistry” approach to detecting small auto-catalytic sets of molecules from within large sets.  Although the algorithm is conceptually straightforward, its success hinges upon the creation of a fitness function able to discriminate large sets that contain subsets of interacting SNPs from those that don’t.  Here, we employ an approximate and noisy fitness function based on the ReliefF data mining algorithm. Preliminary results from synthetic data sets show that the resulting algorithm can detect epistatic pairs from up to 1000 candidate SNPs in O(log N) fitness evaluations, although success rate degrades as heritability declines. The results presented herein are offered as proof of concept for the random chemistry approach, and suggestions are proposed for future improvements that will extend the approach to larger data sets and to lower heritabilities.