In video-based emotion recognition (ER), it is important to effectively leverage the complementary relationship among audio (A) and visual (V) modalities, while retaining the intra-modal characteristics of individual modalities. In this paper, a recursive joint attention model is proposed along with long short-term memory (LSTM) modules for the fusion of vocal and facial expressions in regression-based ER. Specifically, we investigated the possibility of exploiting the complementary nature of A and V modalities using a joint cross-attention model in a recursive fashion with LSTMs to capture the intra-modal temporal dependencies within the same modalities as well as among the A-V feature representations. By integrating LSTMs with recursive joint cross-attention, our model can efficiently leverage both intra- and inter-modal relationships for the fusion of A and V modalities. The results of extensive experiments performed on the challenging Affwild2 and Fatigue (private) datasets indicate that the proposed A-V fusion model can significantly outperform state-of-art-methods.
翻译:在视频情感识别(ER)中,有效地利用音频(A)和视觉(V)模态之间的互补关系,同时保留各自的内部特征对于基于回归的ER至关重要。本文提出了一种递归联合注意力模型,结合长短期记忆(LSTM)模块,用于融合语音和面部表情以进行回归型ER。具体而言,我们调查了使用联合交叉关注模型以递归方式在LSTM中捕获A和V特征的内部时间依赖关系,以及模态之间的关系的可能性。通过将LSTM与递归联合交叉关注融合,我们的模型可以有效地利用模态内部和模态间的关系,从而融合A和V模态。在具有挑战性的Affwild2和疲劳(私有)数据集上进行了广泛的实验,结果表明,所提出的A-V融合模型可以显著优于最先进的方法。