We propose a method for emotion recognition through emotiondependent speech recognition using Wav2vec 2.0. Our method achieved a significant improvement over most previously reported results on IEMOCAP, a benchmark emotion dataset. Different types of phonetic units are employed and compared in terms of accuracy and robustness of emotion recognition within and across datasets and languages. Models of phonemes, broad phonetic classes, and syllables all significantly outperform the utterance model, demonstrating that phonetic units are helpful and should be incorporated in speech emotion recognition. The best performance is from using broad phonetic classes. Further research is needed to investigate the optimal set of broad phonetic classes for the task of emotion recognition. Finally, we found that Wav2vec 2.0 can be fine-tuned to recognize coarser-grained or larger phonetic units than phonemes, such as broad phonetic classes and syllables.
翻译:我们用Wav2vec 2. 0 来建议一种通过情感依赖感官的语音识别来识别情绪的方法。 我们的方法比以前报告的关于IEMOCAP(一个基准情感数据集)的大多数结果大有改进。 采用了不同种类的语音单位,并在数据集和语言内部和之间对情绪识别的准确性和稳健性进行了比较。 电话、 广泛的语音类和音频的模型都大大优于发音模式, 表明语音单位很有帮助, 应该纳入语音识别中。 最佳的性能是使用宽广的语音类。 还需要进一步研究, 以调查用于情感识别任务的广间语音类的最佳组合。 最后, 我们发现Wav2vec 2.0 可以精确调整, 以识别粗俗的或较大的语音单位, 如大语音类和音频类。