Emotion recognition datasets are relatively small, making the use of the more sophisticated deep learning approaches challenging. In this work, we propose a transfer learning method for speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks. We propose to combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model. Further, we compare performance using two different wav2vec 2.0 models, with and without finetuning for speech recognition. We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature.
翻译:情感识别数据集相对较少,因此使用更先进的深层学习方法具有挑战性。在这项工作中,我们提议了一种语言情绪识别转移学习方法,即使用简单的神经网络模拟从预培训的 wav2vec 2.0 模型中提取的特征。我们提议使用与下游模型共同学习的可培训加权数将预培训模式中若干层次的输出合并。此外,我们用两种不同的 wav2vec 2.0 模型比较性能,两种不同的 wav2vec 2.0 模型对语音识别进行和不作微调。我们评价了两个标准情感数据库IEMOCAP和REVDESS的拟议方法,与文献结果相比表现优异。