Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing.
翻译:在这项工作中,我们调查了SER微调过程中这种信息被利用的程度。我们使用基于开放源码工具的可复制方法,合成了假言中性言语,同时对文本的情绪进行了差异。变异器模型的价位预测对正负情绪内容以及否定物反应非常积极,但并不是强化或减少物,但这些语言特征都没有影响振奋或支配力。这些研究结果表明变异器能够成功地利用语言信息来改进其价值预测,语言分析应被纳入测试中。