Many applications require the collection of data on different variables or measurements over many system performance metrics. We term those broadly as measures or variables. Often data collection along each measure incurs a cost, thus it is desirable to consider the cost of measures in modeling. This is a fairly new class of problems in the area of cost-sensitive learning. A few attempts have been made to incorporate costs in combining and selecting measures. However, existing studies either do not strictly enforce a budget constraint, or are not the `most' cost effective. With a focus on classification problem, we propose a computationally efficient approach that could find a near optimal model under a given budget by exploring the most `promising' part of the solution space. Instead of outputting a single model, we produce a model schedule -- a list of models, sorted by model costs and expected predictive accuracy. This could be used to choose the model with the best predictive accuracy under a given budget, or to trade off between the budget and the predictive accuracy. Experiments on some benchmark datasets show that our approach compares favorably to competing methods.
翻译:在许多系统性能衡量尺度上,许多应用都要求收集关于不同变量或计量的数据。我们将这些变量或计量方法广义地称为措施或变量。在每项措施上,数据收集通常都有成本,因此最好考虑建模措施的成本。这是成本敏感学习领域一个相当新的问题类别。在合并和选择措施时,曾尝试将成本纳入其中。然而,现有的研究不是严格地强制实行预算限制,或不是“最有成本效益”的。在侧重于分类问题时,我们建议一种计算效率高的方法,在特定预算下,通过探索解决方案空间中最“最有前途”的部分,可以找到一个接近最佳的模式。我们提出一个模型时间表,而不是输出一个单一模型,一个模型清单,按模型成本和预期的准确性进行分类。这可用于选择在特定预算下具有最佳预测性准确性的模式,或者在预算与预测性准确性之间进行交换。关于某些基准数据集的实验表明,我们的方法比竞争性的方法要好。