With the prevalence of stream media platforms serving music search and recommendation, interpreting music by understanding audio and lyrics interactively has become an important and challenging task. However, many previous works focus on refining individual components of encoder-decoder architecture mapping music to caption tokens, ignoring the potential usage of audio and lyrics correspondence. In this paper, we propose to explicitly learn the multi-modal alignment with retrieval augmentation by contrastive learning. By learning audio-lyrics correspondence, the model is guided to learn better cross-modal attention weights, thus generating high-quality caption words. We provide both theoretical and empirical results that demonstrate the advantage of the proposed method.
翻译:随着为音乐搜索和建议服务的流媒体平台的普及,通过理解音频和歌词互动来解读音乐已成为一项重要而具有挑战性的任务,然而,以往许多工作的重点是完善编码器-编码器结构的个别组成部分,将音乐映射成标语,忽视音频和歌词通信的潜在用途。在本文件中,我们提议通过对比性学习,明确学习与检索增强的多模式一致。通过学习音频学函文,该模型被指导学习更好的跨模式关注权重,从而产生高质量的字幕。我们提供了理论结果和经验结果,展示了拟议方法的优势。