In imbalanced binary classification problems the objective metric is often non-symmetric and associates a higher penalty with the minority samples. On the other hand, the loss function used for training is usually symmetric - equally penalizing majority and minority samples. Balancing schemes, that augment the data to be more balanced before training the model, were proposed to address this discrepancy and were shown to improve prediction performance empirically on tabular data. However, recent studies of consistent classifiers suggest that the metric discrepancy might not hinder prediction performance. In light of these recent theoretical results, we carefully revisit the empirical study of balancing tabular data. Our extensive experiments, on 73 datasets, show that generally, in accordance with theory, best prediction is achieved by using a strong consistent classifier and balancing is not beneficial. We further identity several scenarios for which balancing is effective and observe that prior studies mainly focus on these settings.
翻译:在不平衡的二元分类问题中,客观指标往往是非对称性的,与少数样本相提并论。另一方面,用于培训的损失功能通常是对称的,对多数和少数样本同样进行惩罚。为了解决这一差异,建议了平衡方案,这种平衡方案在培训模型之前增加了更加平衡的数据,为的是解决这一差异,并用表格数据的经验改善了预测绩效。然而,最近对一致分类指标的研究表明,衡量指标的差异可能不妨碍预测绩效。根据这些最近的理论结果,我们仔细研究了平衡表格数据的经验性研究。我们在73个数据集上进行的广泛实验表明,一般来说,根据理论,通过使用强有力的一致分类和平衡,实现最佳的预测是没有好处的。我们进一步确定了一些在平衡方面行之有效的情景,并指出先前的研究主要侧重于这些环境。