Humans usually convey emotions voluntarily or involuntarily by facial expressions. Automatically recognizing the basic expression (such as happiness, sadness, and neutral) from a facial image, i.e., facial expression recognition (FER), is extremely challenging and attracts much research interests. Large scale datasets and powerful inference models have been proposed to address the problem. Though considerable progress has been made, most of the state of the arts employing convolutional neural networks (CNNs) or elaborately modified Vision Transformers (ViTs) depend heavily on upstream supervised pretraining. Transformers are taking place the domination of CNNs in more and more computer vision tasks. But they usually need much more data to train, since they use less inductive biases compared with CNNs. To explore whether a vanilla ViT without extra training samples from upstream tasks is able to achieve competitive accuracy, we use a plain ViT with MAE pretraining to perform the FER task. Specifically, we first pretrain the original ViT as a Masked Autoencoder (MAE) on a large facial expression dataset without expression labels. Then, we fine-tune the ViT on popular facial expression datasets with expression labels. The presented method is quite competitive with 90.22\% on RAF-DB, 61.73\% on AfectNet and can serve as a simple yet strong ViT-based baseline for FER studies.
翻译:人类通常通过面部表情自动或非自愿传递情感。 自动识别面部图像的基本表达方式( 如幸福、悲伤和中性), 即面部表达识别( FER) 极具挑战性, 吸引了许多研究兴趣。 大规模数据集和强大的推论模型被提出来解决这个问题。 尽管已经取得了相当大的进展, 但大部分使用神经神经神经网络(CNNs)或精心修改的视觉变异器(VTs)的艺术状态都严重依赖上游监管的训练前训练。 变异器正在将CNN的主导性置于越来越多的计算机视觉任务中。 但是他们通常需要更多数据来训练, 因为与CNN相比,它们使用的偏向偏向偏向偏向偏向偏向。 要探索没有额外培训样本的Vanilla ViT能否达到竞争的准确性, 我们使用普通的ViTT, 我们首先将原始VIT作为蒙面的自动自动变换图(MAE MAE) 放在一个大型面部表达式表达式的数据中, 之后, 我们使用具有竞争力的VATDBDUD lavex lave- dal lave- dal ex lave- supals laveal laudals labals lave- supals laveals