In this work we tackle the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW2). Poor illumination conditions, head/body orientation and low image resolution constitute factors that can potentially hinder performance in case of methodologies that solely rely on the extraction and analysis of facial features. In order to alleviate this problem, we leverage both bodily and contextual features, as part of a broader emotion recognition framework. We choose to use a standard CNN-RNN cascade as the backbone of our proposed model for sequence-to-sequence (seq2seq) learning. Apart from learning through the RGB input modality, we construct an aural stream which operates on sequences of extracted mel-spectrograms. Our extensive experiments on the challenging and newly assembled Aff-Wild2 dataset verify the validity of our intuitive multi-stream and multi-modal approach towards emotion recognition in-the-wild. Emphasis is being laid on the the beneficial influence of the human body and scene context, as aspects of the emotion recognition process that have been left relatively unexplored up to this point. All the code was implemented using PyTorch and is publicly available.
翻译:在这项工作中,我们处理基于视频的视听情绪识别任务,在第二届讲习班和 " 视觉行为分析竞赛 " (ABAW2)的场地内,我们处理基于视频的视听情绪识别任务。在完全依赖提取和分析面部特征的方法中,光化条件、头部/身体定向和图像分辨率低等构成可能妨碍性能的因素。为了缓解这一问题,我们利用身体和背景特征,作为更广泛的情感识别框架的一部分。我们选择使用标准CNN-RNNN级联号作为我们提议的顺序到顺序(seq2seq)学习模式的支柱。我们除了通过RGB输入模式学习外,我们还建造了一条动画流,在提取的光谱序列上运行。我们对挑战性和新组装的Aff-Wild2数据集的广泛实验证实了我们直观多流和多模式的情感识别方法对于情绪识别的有效性。我们正在强调人体和场景背景的有利影响,因为情感识别过程的方方面面已经使用全方位的情感识别程序。