The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as provided, for example, in the video-based Multimodal EmotionLines Dataset (MELD). However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the localisation of the utterance source. In this paper, we introduce MELD with Fixed Audiovisual Information via Realignment (MELD-FAIR) by using recent active speaker detection and automatic speech recognition models, we are able to realign the videos of MELD and capture the facial expressions from speakers in 96.92% of the utterances provided in MELD. Experiments with a self-supervised voice recognition model indicate that the realigned MELD-FAIR videos more closely match the transcribed utterances given in the MELD dataset. Finally, we devise a model for emotion recognition in conversations trained on the realigned MELD-FAIR videos, which outperforms state-of-the-art models for ERC based on vision alone. This indicates that localising the source of speaking activities is indeed effective for extracting facial expressions from the uttering speakers and that faces provide more informative visual cues than the visual features state-of-the-art models have been using so far. The MELD-FAIR realignment data, and the code of the realignment procedure and of the emotional recognition, are available at https://github.com/knowledgetechnologyuhh/MELD-FAIR.
翻译:情绪识别在谈话中的任务需要利用多种模态,例如基于视频的多模态EmotionLines数据集(MELD)中提供的模态。然而,只有少数研究方法使用MELD视频中的声学和视觉信息。这是由于两个原因:第一,MELD中的标签与视频之间的对准存在噪声,使那些视频成为情感语音数据的不可靠来源。第二,谈话中可能涉及到场景中的多个人,这要求定位话语来源。本文通过使用最近的主动说话者检测和自动语音识别模型,介绍了具有固定音频视觉信息对齐的MELD数据集(MELD-FAIR)。我们能够重新对齐MELD视频,并在96.92%的MELD中提供的话语中捕捉说话者的面部表情。使用自我监督的语音识别模型进行实验,表明重新对准的MELD-FAIR视频更接近于MELD数据集中给定的转录话语。最后,我们设计了一个在重新对准的MELD-FAIR视频上训练的情感识别模型,在基于视觉的ERC方面优于最先进的模型。这表明,定位说话活动的来源实际上对从说话者的面部表情中提取信息是有效的,而面部提供的视觉线索比最先进的模型迄今使用的视觉特征更具信息量。 MELD-FAIR对齐数据,以及重新对齐过程和情感识别的代码,均可在https://github.com/knowledgetechnologyuhh/MELD-FAIR上获得。