Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC.
翻译:在对话视频中,情感可以以多种方式呈现,即音像、视频和文字记录。然而,由于这些模式的内在特点,多模式的ERC始终被认为是一项具有挑战性的工作。现有的ERC研究主要侧重于在讨论中使用文字信息,忽视其他两种模式。我们预计,通过采用多种模式的方法可以提高情绪识别的准确性。因此,在本研究中,我们提议建立一个多模式融合网络(M2FNet),从视觉、听觉和文字模式中提取情感相关特征。由于这些模式的内在特点,多模式式的ERC一直被认为是一项具有挑战性的工作。现有的EMC研究主要侧重于在讨论中使用文字信息,而忽视其他两种模式。我们预计,通过采用新的适应性差值的三重损失功能来提高情绪识别的准确性。因此,我们在ERC领域,现有的方法在一种基准数据集上运行得非常好,但并不是基于平均的 ERC1 模型中,我们提出的REMF1 模型在一种超常态的模型中表现。