We propose a novel multi-task pre-training method for Speech Emotion Recognition (SER). We pre-train SER model simultaneously on Automatic Speech Recognition (ASR) and sentiment classification tasks to make the acoustic ASR model more ``emotion aware''. We generate targets for the sentiment classification using text-to-sentiment model trained on publicly available data. Finally, we fine-tune the acoustic ASR on emotion annotated speech data. We evaluated the proposed approach on the MSP-Podcast dataset, where we achieved the best reported concordance correlation coefficient (CCC) of 0.41 for valence prediction.
翻译:我们提出一个新的多任务前培训方法,用于言语情感识别。我们同时在自动言语识别和情绪分类方面培训SER模型,以使ASR音响模型更加“感官意识 ” 。我们利用通过公开数据培训的文字到感官模型,为情绪分类制定目标。最后,我们微调ASR关于情感的语调数据附加说明。我们评价了MSP-Podcast数据集的拟议方法,我们在该数据集中实现了所报告的最佳一致相关系数0.41(CCC),用于价值预测。