Speech emotion recognition (SER) plays a vital role in improving the interactions between humans and machines by inferring human emotion and affective states from speech signals. Whereas recent works primarily focus on mining spatiotemporal information from hand-crafted features, we explore how to model the temporal patterns of speech emotions from dynamic temporal scales. Towards that goal, we introduce a novel temporal emotional modeling approach for SER, termed Temporal-aware bI-direction Multi-scale Network (TIM-Net), which learns multi-scale contextual affective representations from various time scales. Specifically, TIM-Net first employs temporal-aware blocks to learn temporal affective representation, then integrates complementary information from the past and the future to enrich contextual representations, and finally, fuses multiple time scale features for better adaptation to the emotional variation. Extensive experimental results on six benchmark SER datasets demonstrate the superior performance of TIM-Net, gaining 2.34% and 2.61% improvements of the average UAR and WAR over the second-best on each corpus. The source code is available at https://github.com/Jiaxin-Ye/TIM-Net_SER.
翻译:语音情绪识别(SER)通过从语言信号中推断出人的情感和感官状态,在改善人与机器之间的互动方面发挥着至关重要的作用。虽然最近的工作主要侧重于从手工制作的特征中挖掘时空信息,但我们探索了如何从动态时间尺度中模拟言论情绪的时间模式。为了实现这一目标,我们为SER(称为Temporal-aware bI-direction Multial System Net)(TIM-Net)引入了一种新的时间情感模型方法,它从不同的时间尺度中学习多种规模的背景影响表现。具体地说,TIM-Net首先使用时间觉觉的区块来学习时间感性表现,然后将过去和未来的补充信息结合起来,以丰富背景表现,最后,将多重时间尺度特征结合,以更好地适应情感变化。六个基准SER数据集的广泛实验结果显示了TIM-Net(TIM-Net)的优异性性表现,在每种物质上平均UAR和WAR的优异性提高2.34%和2.61%。源代码见https://github.com/Jiaxin-Ye-TSER_TSER_SER)。</s>