Selecting a minimal feature set that is maximally informative about a target variable is a central task in machine learning and statistics. Information theory provides a powerful framework for formulating feature selection algorithms -- yet, a rigorous, information-theoretic definition of feature relevancy, which accounts for feature interactions such as redundant and synergistic contributions, is still missing. We argue that this lack is inherent to classical information theory which does not provide measures to decompose the information a set of variables provides about a target into unique, redundant, and synergistic contributions. Such a decomposition has been introduced only recently by the partial information decomposition (PID) framework. Using PID, we clarify why feature selection is a conceptually difficult problem when approached using information theory and provide a novel definition of feature relevancy and redundancy in PID terms. From this definition, we show that the conditional mutual information (CMI) maximizes relevancy while minimizing redundancy and propose an iterative, CMI-based algorithm for practical feature selection. We demonstrate the power of our CMI-based algorithm in comparison to the unconditional mutual information on benchmark examples and provide corresponding PID estimates to highlight how PID allows to quantify information contribution of features and their interactions in feature-selection problems.
翻译:对目标变量提供最大信息的最起码的特征集,是机器学习和统计的一项核心任务。信息理论为制定特征选择算法提供了一个强有力的框架。信息理论为制定特征选择算法提供了强有力的框架,然而,对于特征相关性的严格、信息理论定义,其中考虑到一些特征的相互作用,例如冗余和协同贡献,仍然缺少这一定义。我们争辩说,这种缺乏是传统信息理论所固有的,该理论没有提供将一组变量提供将信息分解为独特、冗余和协同贡献的措施。这种分解是最近才通过部分信息分解(PID)框架引入的。我们使用 PID,我们澄清了在使用信息理论进行选择时,特征选择在概念上是一个困难的问题,对特征相关性和冗余性作了新的定义。我们从这一定义中可以看出,有条件的相互信息(CMI)在最大限度地减少冗余的同时,为实际特征选择提出了一种迭接的、基于CMI的算法。我们基于CMI的算法在与无条件的基准示例相互信息比较时具有的威力,并提供相应的PID估计,以突出PID允许在特性和互动中量化特征贡献的特点和问题中的信息贡献。