Real-world datasets are often of high dimension and effected by the curse of dimensionality. This hinders their comprehensibility and interpretability. To reduce the complexity feature selection aims to identify features that are crucial to learn from said data. While measures of relevance and pairwise similarities are commonly used, the curse of dimensionality is rarely incorporated into the process of selecting features. Here we step in with a novel method that identifies the features that allow to discriminate data subsets of different sizes. By adapting recent work on computing intrinsic dimensionalities, our method is able to select the features that can discriminate data and thus weaken the curse of dimensionality. Our experiments show that our method is competitive and commonly outperforms established feature selection methods. Furthermore, we propose an approximation that allows our method to scale to datasets consisting of millions of data points. Our findings suggest that features that discriminate data and are connected to a low intrinsic dimensionality are meaningful for learning procedures.
翻译:实际数据集通常具有高维度,并受到维数诅咒的影响,这使得它们缺乏可理解性和可解释性。为了降低复杂性,特征选择旨在确定对于学习数据至关重要的特征。尽管相关性和成对相似性的测量通常被使用,但很少将维数诅咒纳入选择特征的过程中。在这里,我们提出一种新方法,该方法通过识别可以区分不同大小的数据子集的特征来实现减少维数诅咒的复杂性。通过改进最近关于计算内在维度的研究,我们的方法能够选择可以区分数据并因此削弱维数诅咒的特征。我们的实验表明,我们的方法具有竞争力,并且通常优于已建立的特征选择方法。此外,我们提出了一种近似方法,使我们的方法可扩展到由数百万数据点组成的数据集。我们的发现表明,可以区分数据且与低内在维度相关的特征对于学习过程至关重要。