Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which allows our pre-trained model to achieve better performance. Lastly, we demonstrate detailed ablation studies to prove that both our novel cross-modality fusion component and audio-language pre-training methods significantly contribute to the promising results.
翻译:现有的针对听力任务的预测方法侧重于建立复杂的迟融合机制。然而,这些模型面临着过度装配有限标签和低模范通用能力的挑战。在本文件中,我们介绍了一个用于音频和语言的跨模式变异器,即CTAL,其目的是通过对大量音频和语言对口的两种代用任务学习音频和语言之间的内部和现代联系:隐蔽语言建模和掩蔽的跨模式音频建模。在对多下游音频和语言任务预先培训的模式进行微调后,我们观察到了各种任务的重大改进,例如情感分类、情绪分析和语音校验。在此基础上,我们进一步提议了一种专门设计的聚变机制,可用于微调阶段,从而使我们经过预先培训的模范能够取得更好的业绩。最后,我们展示了详细的反动研究,以证明我们新的跨模式融合组件和音频培训前方法都极大地促进了前景。