The performance of a computer vision model depends on the size and quality of its training data. Recent studies have unveiled previously-unknown composition biases in common image datasets which then lead to skewed model outputs, and have proposed methods to mitigate these biases. However, most existing works assume that human-generated annotations can be considered gold-standard and unbiased. In this paper, we reveal that this assumption can be problematic, and that special care should be taken to prevent models from learning such annotation biases. We focus on facial expression recognition and compare the label biases between lab-controlled and in-the-wild datasets. We demonstrate that many expression datasets contain significant annotation biases between genders, especially when it comes to the happy and angry expressions, and that traditional methods cannot fully mitigate such biases in trained models. To remove expression annotation bias, we propose an AU-Calibrated Facial Expression Recognition (AUC-FER) framework that utilizes facial action units (AUs) and incorporates the triplet loss into the objective function. Experimental results suggest that the proposed method is more effective in removing expression annotation bias than existing techniques.
翻译:计算机视觉模型的性能取决于其培训数据的大小和质量。最近的研究揭示了共同图像数据集中以前未知的构成偏差,从而导致模型产出偏斜,并提出了减少这些偏差的方法。然而,大多数现有工作假设人类产生的注释可被视为金质标准且不带偏见。在本文中,我们发现这一假设可能存在问题,应当特别注意防止模型了解这种注释偏差。我们侧重于面部表情识别,比较实验室控制和虚拟数据集之间的标签偏差。我们表明,许多表达数据集含有两性之间的重大注解偏差,特别是在涉及快乐和愤怒的表达时,传统方法无法完全减轻经过训练的模型中的这种偏差。为了消除表情偏差,我们提议采用AU-C-C-FER(AUC-FER)框架,利用面部动作单位(AUs)将三重损失纳入客观功能。实验结果表明,拟议的方法在消除表达偏差方面比现有技术更为有效。