Automatic emotion recognition is one of the central concerns of the Human-Computer Interaction field as it can bridge the gap between humans and machines. Current works train deep learning models on low-level data representations to solve the emotion recognition task. Since emotion datasets often have a limited amount of data, these approaches may suffer from overfitting, and they may learn based on superficial cues. To address these issues, we propose a novel cross-representation speech model, inspired by disentanglement representation learning, to perform emotion recognition on wav2vec 2.0 speech features. We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models. We further combine the speech-based and text-based results with a score fusion approach. Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem, and it surpasses current works on speech-only, text-only, and multimodal emotion recognition.
翻译:自动情感识别是人类-计算机互动领域的核心关切之一,因为它可以弥合人与机器之间的差距。当前工作在低层次的数据表达中培养深层次的学习模型,以解决情感识别任务。由于情感数据集往往拥有数量有限的数据,因此这些方法可能因过度匹配而受损,它们可能基于肤浅的暗示而学习。为了解决这些问题,我们提出了一个新的跨代表语言模型,在分解代表学习的启发下,对 wav2vec 2.0 语言表达功能进行情感识别。我们还培训了一个基于CNN 的模型,以识别通过变异器模型提取的文本特征所产生的情感。我们进一步将基于语音和基于文本的结果与得分组合法相结合。我们的方法是在四级分类问题中对IMOC数据库进行评估,它超过了目前只使用语音、只使用文本和多式情感识别的作品。