Speech emotion recognition is a challenge and an important step towards more natural human-computer interaction (HCI). The popular approach is multimodal emotion recognition based on model-level fusion, which means that the multimodal signals can be encoded to acquire embeddings, and then the embeddings are concatenated together for the final classification. However, due to the influence of noise or other factors, each modality does not always tend to the same emotional category, which affects the generalization of a model. In this paper, we propose a novel regularization method via contrastive learning for multimodal emotion recognition using audio and text. By introducing a discriminator to distinguish the difference between the same and different emotional pairs, we explicitly restrict the latent code of each modality to contain the same emotional information, so as to reduce the noise interference and get more discriminative representation. Experiments are performed on the standard IEMOCAP dataset for 4-class emotion recognition. The results show a significant improvement of 1.44\% and 1.53\% in terms of weighted accuracy (WA) and unweighted accuracy (UA) compared to the baseline system.
翻译:由于噪音或其他因素的影响,每种模式都不一定倾向于相同的情感类别,从而影响到一种模式的概括性。在本文件中,我们提议一种新型的正规化方法,通过对调学习,用音频和文字来区分多式情感识别。通过引入一个区分相同和不同情感伴侣的区别的区分器,我们明确限制每种模式的潜在代码,以包含相同的情感信息,从而减少噪音干扰,并获得更具有歧视性的表述。实验是在四级情绪识别的IEMOCAP标准数据集上进行的。结果显示,与基线系统相比,在加权精度(WA)和未加权精度(UA)方面,1.44 ⁇ 和1.53 ⁇ 有显著改进。