Facial Action Units detection (FAUs) represents a fine-grained classification problem that involves identifying different units on the human face, as defined by the Facial Action Coding System. In this paper, we present a simple yet efficient Vision Transformer-based approach for addressing the task of Action Units (AU) detection in the context of Affective Behavior Analysis in-the-wild (ABAW) competition. We employ the Video Vision Transformer(ViViT) Network to capture the temporal facial change in the video. Besides, to reduce massive size of the Vision Transformers model, we replace the ViViT feature extraction layers with the CNN backbone (Regnet). Our model outperform the baseline model of ABAW 2023 challenge, with a notable 14% difference in result. Furthermore, the achieved results are comparable to those of the top three teams in the previous ABAW 2022 challenge.
翻译:人脸动作单位检测(FAUs)是一种细粒度分类问题,涉及识别人脸上不同的单位,其定义由面部动作编码系统确定。在本文中,我们提出了一种简单而有效的基于视觉Transformer的方法,用于解决情感行为分析中的动作单位(AU)检测任务。我们使用视频视觉Transformer(ViViT)网络来捕捉视频中的面部改变。此外,为了缩小Vision Transformer模型的大小,我们用CNN骨干(Regnet)替换了ViViT的特征提取层。我们的模型优于ABAW 2023挑战赛的基线模型,结果的差异显著达到了14%。此外,所达到的结果与ABAW 2022挑战赛前三名队伍的结果相当。