控制成本:预算特选 (Controlling Costs: Feature Selection on a Budget)

The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty, or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.

翻译：传统的特征选择框架将所有特征都视为成本相同的成本。但实际上,科学家往往对哪些变量有相当大的自由裁量权,而决定涉及模型准确性和成本之间的权衡(成本可以指金钱、时间、困难或侵扰性)。特别是,不必要地将昂贵的特征纳入模型比不必要地包括廉价特征更糟糕。我们建议一种程序,我们称之为廉价的取舍,用于以成本意识的方式进行特征选择。我们方法背后的关键思想是迫使更高的成本特征与更便宜的特征进行竞争。我们从与这一程序相关的加权虚假发现比例中得出一个上限,这一比例与在不重要的特征上浪费的部分特征成本相对应。我们证明,这一约束与某些变异的尺寸越来越小的路径同时存在很高的可能性。因此,用户可以选择一套特征,例如根据总体预算选择,同时知道仅仅浪费了特定部分特征成本。我们通过模拟和生物医学应用,调查将成本因素纳入特征选择过程的实际重要性。

相关内容

特征选择

关注 5931

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日