Facial Expression Recognition (FER) in the wild is an extremely challenging task in computer vision due to variant backgrounds, low-quality facial images, and the subjectiveness of annotators. These uncertainties make it difficult for neural networks to learn robust features on limited-scale datasets. Moreover, the networks can be easily distributed by the above factors and perform incorrect decisions. Recently, vision transformer (ViT) and data-efficient image transformers (DeiT) present their significant performance in traditional classification tasks. The self-attention mechanism makes transformers obtain a global receptive field in the first layer which dramatically enhances the feature extraction capability. In this work, we first propose a novel pure transformer-based mask vision transformer (MViT) for FER in the wild, which consists of two modules: a transformer-based mask generation network (MGN) to generate a mask that can filter out complex backgrounds and occlusion of face images, and a dynamic relabeling module to rectify incorrect labels in FER datasets in the wild. Extensive experimental results demonstrate that our MViT outperforms state-of-the-art methods on RAF-DB with 88.62%, FERPlus with 89.22%, and AffectNet-7 with 64.57%, respectively, and achieves a comparable result on AffectNet-8 with 61.40%.
翻译:野外的偏狭表现识别(FER)在计算机视觉中是一项极具挑战性的任务,因为不同背景、低质量面部图像和批注者的主观性。这些不确定性使得神经网络难以在有限规模的数据集中学习强健特征。此外,这些网络可以容易地由上述因素传播,并做出不正确的决定。最近,视觉变压器和数据高效图像变异器(DeiT)展示了它们在传统分类任务中的重要表现。自备机制使变压器在第一层获得一个全球可接受域,这极大地增强了特征提取能力。在这项工作中,我们首先为野外FER提出了一个新的纯纯的变压器遮罩变压器(MViT),它由两个模块组成:一个基于变压器的面具生成网络(MGN),以生成一个能过滤复杂背景和面部图像封闭的面具,以及一个动态再标签模块,以纠正FERF数据集中错误的标签。广泛的实验结果表明,我们的MVIT超越了地特征提取能力。我们MVIT的状态,以89-22和8-eal-efl结果分别在88的AFDBAF-22结果中实现了。