通过 Minipatch 学习选择大数据功能 (Feature Selection for Huge Data via Minipatch Learning)

Feature selection often leads to increased model interpretability, faster computation, and improved model performance by discarding irrelevant or redundant features. While feature selection is a well-studied problem with many widely-used techniques, there are typically two key challenges: i) many existing approaches become computationally intractable in huge-data settings with millions of observations and features; and ii) the statistical accuracy of selected features degrades in high-noise, high-correlation settings, thus hindering reliable model interpretation. We tackle these problems by proposing Stable Minipatch Selection (STAMPS) and Adaptive STAMPS (AdaSTAMPS). These are meta-algorithms that build ensembles of selection events of base feature selectors trained on many tiny, (adaptively-chosen) random subsets of both the observations and features of the data, which we call minipatches. Our approaches are general and can be employed with a variety of existing feature selection strategies and machine learning techniques. In addition, we provide theoretical insights on STAMPS and empirically demonstrate that our approaches, especially AdaSTAMPS, dominate competing methods in terms of feature selection accuracy and computational time.

翻译：地物选择往往导致更多的模型解释性、更快的计算,并通过抛弃不相关或冗余的特征而改进模型性能。虽然地物选择是许多广泛使用的技术所研究的问题,但通常有两个关键挑战:(1) 许多现有方法在巨大的数据环境中变得难以计算,有数百万次观测和特征;(2) 选定特征在高噪音、高交错环境中的统计准确性下降,从而妨碍可靠的模型解释。我们通过提出稳定米帕奇选择(STAMPS)和适应性STAMPS(AdaSTAMPS)来解决这些问题。这些是元性algoits,它们组成了以许多微小、(适应性选择性)数据观测和特征随机组合为培训的基础物选择者选择活动的集合,我们称之为“微型”。我们的方法是一般性的,可以运用于各种现有地物选择战略和机器学习技术。此外,我们从理论上深入了解STAMPS和实验性地展示了我们的方法,特别是AdaSTAMPS,在地物选精度和时间计算方面支配竞合的方法。

相关内容

特征选择

关注 5931

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

自动驾驶智能系统测试研究综述

专知会员服务

96+阅读 · 2021年1月24日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【自监督学习深度神经网络视觉特征学习综述论文】Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

专知会员服务

87+阅读 · 2020年3月1日