The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty, or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.
翻译:传统的特征选择框架将所有特征都视为成本相同的成本。但实际上,科学家往往对哪些变量有相当大的自由裁量权,而决定涉及模型准确性和成本之间的权衡(成本可以指金钱、时间、困难或侵扰性)。特别是,不必要地将昂贵的特征纳入模型比不必要地包括廉价特征更糟糕。我们建议一种程序,我们称之为廉价的取舍,用于以成本意识的方式进行特征选择。我们方法背后的关键思想是迫使更高的成本特征与更便宜的特征进行竞争。我们从与这一程序相关的加权虚假发现比例中得出一个上限,这一比例与在不重要的特征上浪费的部分特征成本相对应。我们证明,这一约束与某些变异的尺寸越来越小的路径同时存在很高的可能性。因此,用户可以选择一套特征,例如根据总体预算选择,同时知道仅仅浪费了特定部分特征成本。我们通过模拟和生物医学应用,调查将成本因素纳入特征选择过程的实际重要性。