In human-computer interaction, Speech Emotion Recognition (SER) plays an essential role in understanding the user's intent and improving the interactive experience. While similar sentimental speeches own diverse speaker characteristics but share common antecedents and consequences, an essential challenge for SER is how to produce robust and discriminative representations through causality between speech emotions. In this paper, we propose a Gated Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel emotional causality representation learning component with a multi-scale receptive field. GM-TCNet deploys a novel emotional causality representation learning component to capture the dynamics of emotion across the time domain, constructed with dilated causal convolution layer and gating mechanism. Besides, it utilizes skip connection fusing high-level features from different gated convolution blocks to capture abundant and subtle emotion changes in human speech. GM-TCNet first uses a single type of feature, mel-frequency cepstral coefficients, as inputs and then passes them through the gated temporal convolutional module to generate the high-level features. Finally, the features are fed to the emotion classifier to accomplish the SER task. The experimental results show that our model maintains the highest performance in most cases compared to state-of-the-art techniques.
翻译:在人机互动中,言语情感认知(SER)在理解用户意图和改善互动经验方面发挥着关键作用。类似的感性演讲具有不同的演讲者特点,但具有共同的前奏和后果,而SER面临的一个基本挑战是如何通过言语情感之间的因果关系产生强大和有区别的表达。在本文中,我们提议建立一个GM-TCNet(GM-TCNet),以构建一个具有多尺度可接受域的新颖的情感因果关系代表学习部分。GM-TCNet(GM-TCNet)部署一个新的情感因果关系代表学习部分,以捕捉整个时间域的情感动态,该元素由分层和格调机制组成。此外,SER还利用来自不同门形变异块的高层次特征连接跳过连接,以捕捉人类言语中的丰富和微妙的情感变化。GMM-TCNet(GM-TCNet)首先使用单一类型的特征,即Mel-频缩略系数,作为投入,然后通过 Gate-convolutional 模块传递,以生成高层次的特征。最后,这些特征被反馈给情感分析器,以完成SER最高级的工作。