Contrastive cross-modality pretraining approaches have recently exhibited impressive success in diverse fields. In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for speech emotion recognition (SER).Specifically, an effective emotion CLAP model (Emo-CLAP) is first built, using various self-supervised pre-trained models for SER. Second, given the significance of the gender attribute in speech emotion modeling, two novel soft label based GEmo-CLAP (SL-GEmo-CLAP) and multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP demonstrate that our proposed two GEmo-CLAPs consistently outperform the baseline Emo-CLAP with various pre-trained models, while also achieving the best recognition performance compared with state-of-the-art SER methods. Remarkably, the proposed WavLM-based SL-GEmo-CLAP model achieves the best UAR of 81.43\% and WAR of 83.16\%.
翻译:暂无翻译