Speech emotion recognition (SER) classifies audio into emotion categories such as Happy, Angry, Fear, Disgust and Neutral. While Speech Emotion Recognition (SER) is a common application for popular languages, it continues to be a problem for low-resourced languages, i.e., languages with no pretrained speech-to-text recognition models. This paper firstly proposes a language-specific model that extract emotional information from multiple pre-trained speech models, and then designs a multi-domain model that simultaneously performs SER for various languages. Our multidomain model employs a multi-gating mechanism to generate unique weighted feature combination for each language, and also searches for specific neural network structure for each language through a neural architecture search module. In addition, we introduce a contrastive auxiliary loss to build more separable representations for audio data. Our experiments show that our model raises the state-of-the-art accuracy by 3% for German and 14.3% for French.
翻译:语音感官识别 (SER) 将音频分类为情感类别, 如幸福、 愤怒、 恐惧、 厌恶和中立。 虽然语音情感识别(SER) 是流行语言的一种常见应用, 但对于低资源语言来说仍然是一个问题, 即语言, 即没有经过预先训练的语音到文字识别模式。 本文首先提出一个语言特定模型, 从多个经过训练的语音模型中提取情感信息, 然后设计一个同时为多种语言运行 SER 的多域模型。 我们的多域模型使用多色化机制, 生成每种语言的独特加权特征组合, 并通过神经结构搜索模块搜索每种语言的具体神经网络结构。 此外, 我们引入了一种对比性辅助损失, 以构建更相容的音频数据表 。 我们的实验显示, 我们的模型提高了德语的最新准确度3%, 法语 14.3% 。