Despite significant progress, speech emotion recognition (SER) remains challenging due to inherent complexity and ambiguity of the emotion attribute, particularly in wild world. Whereas current studies primarily focus on recognition and generalization abilities, this work pioneers an investigation into the reliability of SER methods and explores the modeling of speech emotion based on data distribution across various speech attributes. Specifically, a novel CNN-based SER model that adopts additive margin softmax loss is first desgined. Second, a novel multiple speech attribute control method MSAC is proposed to explicitly control speech attributes, enabling the model to be less affected by emotion-agnostic features and extract fine-grained emotion-related representations. Third, we make a first attempt to examine the reliability of our proposed unified SER workflow using the out-of-distribution detection method. Experiments on both single and cross-corpus SER scenarios show that our proposed unified SER workflow consistently outperforms the baseline in all aspects. Remarkably, in single-corpus SER, the proposed SER workflow achieves superior recognition results with a WAR of 72.97% and a UAR of 71.76% on the IEMOCAP corpus.
翻译:暂无翻译