The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes having a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called bias-conflicting samples. These methods are based on explicit bias labels involving human or empirical correlation metrics (e.g., training loss). However, such metrics require human costs or have insufficient theoretical explanation. In this study, we propose a debiasing algorithm, called PGD (Per-sample Gradient-based Debiasing), that comprises three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importance-batch sampling, whose probability is obtained in step (2). Compared with existing baselines for various synthetic and real-world datasets, the proposed method showed state-of-the-art accuracy for a the classification task. Furthermore, we describe theoretical understandings about how PGD can mitigate dataset bias.
翻译:深神经网络的性能受到培训数据集的强烈影响,特别是当存在与目标属性密切相关的特征时,经过培训的模型可以提供意外的预断,并显示重大的推论错误(即数据集偏差问题)。提出了各种方法来减轻数据集偏差,重点是关系薄弱的样本,称为偏差-冲突样本。这些方法基于涉及人类或经验相关指标(例如培训损失)的明显偏见标签;然而,这类指标需要人的代价或理论解释不充分。在本研究中,我们建议采用一种称为PGD(Per-sample Gradient-Debasing)的偏差算法,其中包括三个步骤:(1) 培训统一的批量抽样模型,(2) 确定每个样本与样本梯度标准的比例,(3) 培训使用重要批量抽样模型,其概率在步骤中获得(2) 。与各种合成和真实世界数据集的现有基线相比,拟议方法显示对降低数据分类的理论偏差的准确度。此外,我们描述了如何减轻数据分类的理论偏差。