Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training SER. The relationship between ASR and SER is understudied, and it is unclear what and how ASR features benefit SER. By examining various ASR outputs and fusion methods, our experiments show that in joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most. On the IEMOCAP corpus, our approach achieves 63.4% weighted accuracy, which is close to the baseline results achieved by combining ground-truth transcripts. In addition, we also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.
翻译:在声学信息之外,基于语音文字记录的语言特征在语音情感识别(SER)中被证明是有用的,然而,由于情感标签数据稀少,难以识别情感语言特征,因此很难在这个研究领域获得可靠的语言特征和模型。在本文件中,我们提议将自动语音识别(ASR)产出纳入联合培训SER的管道。ASR和SER之间的关系受到低估,而且还不清楚ASR特征如何对SER产生益处。通过审查各种ASR输出和聚合方法,我们的实验表明,在ASR-SER联合培训中,采用等级式的共留注意聚合方法将ASR隐藏和文本输出都纳入其中,提高了SER的性能。在IEMOCAP系统中,我们的方法达到了63.4%的加权精度,这接近于通过将地心记录合并而取得的基线结果。此外,我们还对IEMOCAP和Wav2vec 2.0模型的层差异分析提出了新的字率分析。