Multimodal emotion recognition is an active research topic in the field of artificial intelligence. It aims to integrate multimodal clues (including acoustic, visual, and lexical clues) and recognize human emotional states from these clues. Current works generally assume correct emotion labels for benchmark datasets and focus on building more effective architectures to achieve better performance. But due to the ambiguity and subjectivity of emotion, existing datasets cannot achieve high annotation consistency (i.e., labels may be inaccurate), making it difficult for models developed on these datasets to meet the demand of practical applications. To address this problem, the core is to improve the reliability of emotion annotations. Therefore, we propose a new task called ``Explainable Multimodal Emotion Reasoning (EMER)''. Unlike previous works that only predict emotional states, EMER further explains the reasons behind these predictions to enhance their reliability. In this task, rationality is the only evaluation metric. As long as the emotional reasoning process for a given video is plausible, the prediction is correct. In this paper, we make an initial attempt at this task and establish a benchmark dataset, baselines, and evaluation metrics. We aim to address the long-standing problem of label ambiguity and point a way to the next-generation affective computing techniques. In addition, EMER can also be exploited to evaluate the audio-video-text understanding ability of recent multimodal large language models. Code and data: https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning.
翻译:暂无翻译