Feature selection often leads to increased model interpretability, faster computation, and improved model performance by discarding irrelevant or redundant features. While feature selection is a well-studied problem with many widely-used techniques, there are typically two key challenges: i) many existing approaches become computationally intractable in huge-data settings with millions of observations and features; and ii) the statistical accuracy of selected features degrades in high-noise, high-correlation settings, thus hindering reliable model interpretation. We tackle these problems by proposing Stable Minipatch Selection (STAMPS) and Adaptive STAMPS (AdaSTAMPS). These are meta-algorithms that build ensembles of selection events of base feature selectors trained on many tiny, (adaptively-chosen) random subsets of both the observations and features of the data, which we call minipatches. Our approaches are general and can be employed with a variety of existing feature selection strategies and machine learning techniques. In addition, we provide theoretical insights on STAMPS and empirically demonstrate that our approaches, especially AdaSTAMPS, dominate competing methods in terms of feature selection accuracy and computational time.
翻译:地物选择往往导致更多的模型解释性、更快的计算,并通过抛弃不相关或冗余的特征而改进模型性能。虽然地物选择是许多广泛使用的技术所研究的问题,但通常有两个关键挑战:(1) 许多现有方法在巨大的数据环境中变得难以计算,有数百万次观测和特征;(2) 选定特征在高噪音、高交错环境中的统计准确性下降,从而妨碍可靠的模型解释。我们通过提出稳定米帕奇选择(STAMPS)和适应性STAMPS(AdaSTAMPS)来解决这些问题。这些是元性algoits,它们组成了以许多微小、(适应性选择性)数据观测和特征随机组合为培训的基础物选择者选择活动的集合,我们称之为“微型”。我们的方法是一般性的,可以运用于各种现有地物选择战略和机器学习技术。此外,我们从理论上深入了解STAMPS和实验性地展示了我们的方法,特别是AdaSTAMPS,在地物选精度和时间计算方面支配竞合的方法。