提高图像分类去偏差方法的评估 (Improving Evaluation of Debiasing in Image Classification)

Image classifiers often rely overly on peripheral attributes that have a strong correlation with the target class (i.e., dataset bias) when making predictions. Due to the dataset bias, the model correctly classifies data samples including bias attributes (i.e., bias-aligned samples) while failing to correctly predict those without bias attributes (i.e., bias-conflicting samples). Recently, a myriad of studies focus on mitigating such dataset bias, the task of which is referred to as debiasing. However, our comprehensive study indicates several issues need to be improved when conducting evaluation of debiasing in image classification. First, most of the previous studies do not specify how they select their hyper-parameters and model checkpoints (i.e., tuning criterion). Second, the debiasing studies until now evaluated their proposed methods on datasets with excessively high bias-severities, showing degraded performance on datasets with low bias severity. Third, the debiasing studies do not share consistent experimental settings (e.g., datasets and neural networks) which need to be standardized for fair comparisons. Based on such issues, this paper 1) proposes an evaluation metric `Align-Conflict (AC) score' for the tuning criterion, 2) includes experimental settings with low bias severity and shows that they are yet to be explored, and 3) unifies the standardized experimental settings to promote fair comparisons between debiasing methods. We believe that our findings and lessons inspire future researchers in debiasing to further push state-of-the-art performances with fair comparisons.

翻译：图像分类器在进行预测时往往过度依赖与目标类别强相关的外围属性（即数据集偏差）。由于数据集的偏差，模型在正确分类带有偏差属性的数据样本（即偏差相同的样本）时表现良好，但在正确预测没有偏差属性的样本（即偏差冲突的样本）时失败。最近，许多研究都集中在减轻这种数据集偏差的问题上，这项任务被称为去偏差。然而，我们全面的研究表明，在图像分类中进行去偏差方法的评估时需要改进。首先，大多数以前的研究没有明确说明如何选择他们的超参数和模型检查点（即调整标准）。其次，迄今为止的去偏差研究在高度偏置的数据集上进行评估，而对于低偏置严重性的数据集表现下降。第三，去偏差研究没有共享一致的实验环境（例如数据集和神经网络），需要标准化以进行公平的比较。基于这些问题，本文1）为调整标准提出了一个评估指标“偏差相同-偏差冲突（Align-Conflict，AC）得分”，2）包括了低偏置严重性的实验环境，并表明它们还没有得到探究，3）统一了标准化的实验环境，以促进去偏差方法之间的公平比较。我们相信我们的发现和经验会激发未来的去偏差研究者推进最先进的性能并进行公平比较。