平衡回归的模型优化 (Model Optimization in Imbalanced Regression)

Imbalanced domain learning aims to produce accurate models in predicting instances that, though underrepresented, are of utmost importance for the domain. Research in this field has been mainly focused on classification tasks. Comparatively, the number of studies carried out in the context of regression tasks is negligible. One of the main reasons for this is the lack of loss functions capable of focusing on minimizing the errors of extreme (rare) values. Recently, an evaluation metric was introduced: Squared Error Relevance Area (SERA). This metric posits a bigger emphasis on the errors committed at extreme values while also accounting for the performance in the overall target variable domain, thus preventing severe bias. However, its effectiveness as an optimization metric is unknown. In this paper, our goal is to study the impacts of using SERA as an optimization criterion in imbalanced regression tasks. Using gradient boosting algorithms as proof of concept, we perform an experimental study with 36 data sets of different domains and sizes. Results show that models that used SERA as an objective function are practically better than the models produced by their respective standard boosting algorithms at the prediction of extreme values. This confirms that SERA can be embedded as a loss function into optimization-based learning algorithms for imbalanced regression scenarios.

翻译：平衡的域学习旨在产生准确的模型,预测那些尽管代表不足但对领域至关重要的事例。这一领域的研究主要侧重于分类任务。比较而言,在回归任务背景下进行的研究数量微不足道。主要原因之一是缺乏能够集中尽量减少极端(拉里)值错误的损失功能。最近,采用了一项评价指标:《平方错误相关性区域》。该指标更强调极端价值的错误,同时也考虑到总目标变量域的性能,从而防止严重偏差。然而,其作为优化衡量标准的效果并不为人所知。在本文件中,我们的目标是研究在不平衡回归任务中使用SERA作为优化标准标准的影响。利用梯度加速算法作为概念的证明,我们进行了一项实验性研究,有36套不同领域和大小的数据。结果显示,使用SERA作为客观函数的模式实际上比它们各自在预测极端价值的标准推进算法所产生的模型要好得多。这证实SERA可以将SERA作为一种不平衡的回归模型嵌入到模型中去。