Speech Self-Supervised Learning (SSL) has demonstrated considerable efficacy in various downstream tasks. Nevertheless, prevailing self-supervised models often overlook the incorporation of emotion-related prior information, thereby neglecting the potential enhancement of emotion task comprehension through emotion prior knowledge in speech. In this paper, we propose an emotion-aware speech representation learning with intensity knowledge. Specifically, we extract frame-level emotion intensities using an established speech-emotion understanding model. Subsequently, we propose a novel emotional masking strategy (EMS) to incorporate emotion intensities into the masking process. We selected two representative models based on Transformer and CNN, namely MockingJay and Non-autoregressive Predictive Coding (NPC), and conducted experiments on IEMOCAP dataset. Experiments have demonstrated that the representations derived from our proposed method outperform the original model in SER task.
翻译:暂无翻译