选择具有抗维数诅咒能力的特征 (Selecting Features by their Resilience to the Curse of Dimensionality)

Real-world datasets are often of high dimension and effected by the curse of dimensionality. This hinders their comprehensibility and interpretability. To reduce the complexity feature selection aims to identify features that are crucial to learn from said data. While measures of relevance and pairwise similarities are commonly used, the curse of dimensionality is rarely incorporated into the process of selecting features. Here we step in with a novel method that identifies the features that allow to discriminate data subsets of different sizes. By adapting recent work on computing intrinsic dimensionalities, our method is able to select the features that can discriminate data and thus weaken the curse of dimensionality. Our experiments show that our method is competitive and commonly outperforms established feature selection methods. Furthermore, we propose an approximation that allows our method to scale to datasets consisting of millions of data points. Our findings suggest that features that discriminate data and are connected to a low intrinsic dimensionality are meaningful for learning procedures.

翻译：实际中的数据集通常具有高维度和受到维度诅咒的影响，这会阻碍它们的可理解性和可解释性。为了减少复杂性，特征选择旨在识别对学习数据至关重要的特征。虽然相关性和成对相似性度量通常被使用，但是维数诅咒很少被纳入选择特征的过程中。在这里，我们提出了一种新颖的方法，通过识别能够区分不同数据子集的特征来降低复杂性。通过采用最近关于计算内在维度的工作，我们的方法能够选择能够区分数据并因此减弱维数诅咒的特征。我们的实验表明，我们的方法是有竞争力的，常常优于常规的特征选择方法。此外，我们提出了一种近似方法，使我们的方法能够扩展到由数百万数据点组成的数据集。我们的研究结果表明，能够区分数据并与低内在维度相关的特征对于学习过程是有意义的。

相关内容

特征选择

关注 5934

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

专知会员服务

22+阅读 · 2022年3月7日

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

专知会员服务

33+阅读 · 2020年4月26日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日