Not all real-world data are labeled, and when labels are not available, it is often costly to obtain them. Moreover, as many algorithms suffer from the curse of dimensionality, reducing the features in the data to a smaller set is often of great utility. Unsupervised feature selection aims to reduce the number of features, often using feature importance scores to quantify the relevancy of single features to the task at hand. These scores can be based only on the distribution of variables and the quantification of their interactions. The previous literature, mainly investigating anomaly detection and clusters, fails to address the redundancy-elimination issue. We propose an evaluation of correlations among features to compute feature importance scores representing the contribution of single features in explaining the dataset's structure. Based on Coalitional Game Theory, our feature importance scores include a notion of redundancy awareness making them a tool to achieve redundancy-free feature selection. We show that the deriving features' selection outperforms competing methods in lowering the redundancy rate while maximizing the information contained in the data. We also introduce an approximated version of the algorithm to reduce the complexity of Shapley values' computations.
翻译:并不是所有真实世界数据都有标签,当没有标签时,获取这些数据往往成本很高。此外,由于许多算法受到维度的诅咒,因此将数据特性降为较小数据集往往有很大的效用。未经监督的特性选择旨在减少特征数量,通常使用特性重要评分来量化单个特性与手头任务的相关性。这些评分只能基于变量的分布和对它们相互作用的量化。以前的文献,主要是调查异常探测和集群,未能解决冗余消除问题。我们建议对计算特征重要性分数的特性之间的关联性进行评估,以体现单个特性在解释数据集结构方面的贡献。根据联盟游戏理论,我们的特性重要性评分包括一种裁员意识概念,使这些特性成为实现免冗余特性选择的工具。我们显示,衍生特征的选定在降低冗余率的同时最大限度地增加数据所含信息,不符合竞争性的方法。我们还采用了一种大致的算法版本,以降低Shaply值计算的复杂性。