Contrastive learning based cross-modality pretraining methods have recently exhibited impressive success in diverse fields. In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for speech emotion recognition. Specifically, a novel emotion CLAP model (Emo-CLAP) is first built, utilizing various self-supervised pre-trained models. Second, considering the importance of gender attribute in speech emotion modeling, the soft label based GEmo-CLAP (SL-GEmo-CLAP) and multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) are further proposed to integrate the emotion and gender information of speech signals, forming more reasonable objectives. Extensive experiments on IEMOCAP show that our proposed two GEmo-CLAP models consistently outperform the baseline Emo-CLAP with different pre-trained models, while also achieving the best recognition performance compared with recent state-of-the-art methods. Noticeably, the proposed WavLM-based ML-GEmo-CLAP obtains the best UAR of 80.16\% and WAR of 82.06\%.
翻译:暂无翻译