Despite advances in deep learning, current state-of-the-art speech emotion recognition (SER) systems still have poor performance due to a lack of speech emotion datasets. This paper proposes augmenting SER systems with synthetic emotional speech generated by an end-to-end text-to-speech (TTS) system based on an extended Tacotron architecture. The proposed TTS system includes encoders for speaker and emotion embeddings, a sequence-to-sequence text generator for creating Mel-spectrograms, and a WaveRNN to generate audio from the Mel-spectrograms. Extensive experiments show that the quality of the generated emotional speech can significantly improve SER performance on multiple datasets, as demonstrated by a higher mean opinion score (MOS) compared to the baseline. The generated samples were also effective at augmenting SER performance.
翻译:尽管在深层学习方面取得了进展,但由于缺乏语音情感数据集,目前最先进的语音情绪识别系统仍然表现不佳。本文建议,通过基于扩展的电磁图结构的端到端文本到语音系统产生的合成情感语音,增强SER系统。拟议的TTS系统包括语音和情感嵌入编码器、创建Mel-spectrogram的顺序到序列的文本生成器和生成Mel-spectrogram的音频的WaveRNN。广泛的实验表明,生成的情感语音质量可以大大改善SER在多个数据集上的性能,这表现为与基线相比,平均意见分数(MOS)更高。生成的样本还有效提高了SER的性能。