Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.
翻译:语音情感识别(SER)对于增强人机交互至关重要。本文提出“EmoHRNet”,一种专为SER定制的高分辨率网络(HRNet)新变体。HRNet结构旨在从初始层到最终层始终保持高分辨率表征。通过将音频样本转换为频谱图,EmoHRNet利用HRNet架构提取高级特征。其独特架构全程保持高分辨率表征,从而捕捉语音信号中细粒度与整体性的情感线索。该模型性能优于主流模型,在RAVDESS、IEMOCAP和EMOVO数据集上分别达到92.45%、80.06%和92.77%的准确率。因此,我们证明EmoHRNet为SER领域确立了新的基准。