The COVID-19 pandemic has undoubtedly changed the standards and affected all aspects of our lives, especially social life. It has forced people to extensively wear medical face masks, in order to prevent transmission. This face occlusion can strongly irritate emotional reading from the face and urges us to incorporate the whole body for emotion recognition, as it needs to play a more major role, despite its complementary nature. In this paper, we want to conduct insightful studies about the effect of face occlusion on emotion recognition performance, and showcase the superiority of full body input over plain masked face. We utilize a deep learning model based on the Temporal Segment Network framework and aspire to fully overcome the consequences of the face mask. Although single RGB stream models can adapt and learn both facial and bodily features, this may lead to irrelevant information confusion. By processing those features separately and fusing their preliminary prediction scores with a late fusion scheme, we are more effectively taking advantage of both modalities. This architecture can also naturally support temporal modeling, by mingling information among neighboring segment frames. Experimental results suggest that spatial structure plays a more important role for an emotional expression, while temporal structure is complementary.
翻译:COVID-19大流行无疑改变了标准,影响了我们生活的方方面面,特别是社会生活,迫使人们广泛佩戴医疗面罩,以防止传播。这种面部排斥会强烈刺激脸上的情感阅读,敦促我们将整个身体纳入其中以引起情感认知,因为它需要发挥更重要的作用,尽管其性质是互补的。在本文中,我们希望对面部排斥对情感识别表现的影响进行深刻的研究,并展示全身输入优于普通面部的优势。我们利用基于时空部分网络框架的深层学习模型,渴望完全克服面部面具的后果。虽然单一RGB流模式可以适应和学习面部和身体特征,但可能导致不相关的信息混乱。通过分别处理这些特征并用延迟融合计划来使用其初步预测分数,我们就能更有效地利用这两种模式。这一结构也可以自然地支持时间模型,通过在相邻区框间混合信息。实验结果表明空间结构在情感表达方面起着更重要的作用,而时间结构则是互补的。