The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as offered, for example, in the video-based MELD dataset. However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the detection of the person speaking the utterance. In this paper we demonstrate that by using recent automatic speech recognition and active speaker detection models, we are able to realign the videos of MELD, and capture the facial expressions from uttering speakers in 96.92% of the utterances provided in MELD. Experiments with a self-supervised voice recognition model indicate that the realigned MELD videos more closely match the corresponding utterances offered in the dataset. Finally, we devise a model for emotion recognition in conversations trained on the face and audio information of the MELD realigned videos, which outperforms state-of-the-art models for ERC based on vision alone. This indicates that active speaker detection is indeed effective for extracting facial expressions from the uttering speakers, and that faces provide more informative visual cues than the visual features state-of-the-art models have been using so far.
翻译:对话中的情绪识别任务(ERC)得益于多种模式的提供,例如基于视频的MELD数据集。然而,只有少数研究方法使用MELD视频中的声学和视觉信息。原因有二:第一,MELD的标签和视频对齐,使这些视频成为情感语音数据的不可靠的来源。第二,对话可以涉及同一场景中的若干人,这需要检测讲出话的人。在本文中,我们证明通过使用最近的自动语音识别和积极语音检测模型,我们能够调整MELD的视频,并捕捉MELD中96.92%的发声者的面部表情表情表达。自我监督语音识别模型的实验表明,经过调整的MELD视频更接近于数据集中的相应语句。最后,我们设计了一个在MELD调整后的视频对面和音频信息的交谈中识别模型,这些图像比ENLD的状态模型更优于ERC的图像模型,仅靠远的直观图像演示提供了远远远的图像检测。这显示式的图像特征表明,通过远远远的图像演示提供了远远远远的面像。