Facial expression recognition (FER) is an essential task for understanding human behaviors. As one of the most informative behaviors of humans, facial expressions are often compound and variable, which is manifested by the fact that different people may express the same expression in very different ways. However, most FER methods still use one-hot or soft labels as the supervision, which lack sufficient semantic descriptions of facial expressions and are less interpretable. Recently, contrastive vision-language pre-training (VLP) models (e.g., CLIP) use text as supervision and have injected new vitality into various computer vision tasks, benefiting from the rich semantics in text. Therefore, in this work, we propose CLIPER, a unified framework for both static and dynamic facial Expression Recognition based on CLIP. Besides, we introduce multiple expression text descriptors (METD) to learn fine-grained expression representations that make CLIPER more interpretable. We conduct extensive experiments on several popular FER benchmarks and achieve state-of-the-art performance, which demonstrates the effectiveness of CLIPER.
翻译:面部表情往往是人类信息最丰富的行为之一,其表现方式往往复杂多样,表现为不同的人以非常不同的方式表达同一种表达方式。然而,大多数FER方法仍然使用单热或软标签作为监督,这种标签缺乏对面部表情的足够的语义描述,而且不易解释。最近,对比式的视觉语言培训前模型(如CLIP)使用文本作为监督,为各种计算机的视觉任务注入了新的活力,从丰富的文字语义中受益。因此,我们在此工作中提议CLIPER,这是基于CLIP的静态和动态面部表达识别的统一框架。此外,我们引入多种表达式描述器(MEDD)学习精细的表达式,使CLIPER更容易解释。我们广泛试验了几种流行的FER基准,并实现了显示CLIPER的有效性的状态。</s>