In this work we tackle the task of video-based visual emotion recognition in the wild. Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction in cases where the aforementioned sources of affective information are inaccessible due to head/body orientation, low resolution and poor illumination. We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes, as part of a broader emotion recognition framework. Temporal Segment Networks (TSN) constitute the backbone of our proposed model. Apart from the RGB input modality, we make use of dense Optical Flow, following an intuitive multi-stream approach for a more effective encoding of motion. Furthermore, we shift our attention towards skeleton-based learning and leverage action-centric data as means of pre-training a Spatial-Temporal Graph Convolutional Network (ST-GCN) for the task of emotion recognition. Our extensive experiments on the challenging Body Language Dataset (BoLD) verify the superiority of our methods over existing approaches, while by properly incorporating all of the aforementioned modules in a network ensemble, we manage to surpass the previous best published recognition scores, by a large margin.
翻译:仅依靠提取身体和面部特征的标准方法往往没有准确的情感预测,如果上述感官信息来源因头部/身体定向、分辨率低和光度差而无法获得,我们渴望通过利用视觉背景,作为更广泛的情感识别框架的一部分,以场景特征和属性的形式缓解这一问题。时间段网络(TSN)构成我们拟议模式的支柱。除了RGB输入模式外,我们使用密集的光学流动,采用直观的多流方法更有效地编译运动。此外,我们把注意力转向基于骨架的学习和以行动为中心的杠杆数据,作为预先培训一个空间-时空图演动网络(ST-GCN),以进行情感识别任务的手段。我们对具有挑战性的身体语言数据集(BOLD)的广泛实验证实了我们方法优于现有方法的优势,同时将所有上述模块适当纳入网络内容组合,我们设法以大幅度超过以往公布的最佳识别分数。