This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources related to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation.
翻译:这一论文提议在音乐信号的背景下研究多式联运学习。我们始终侧重于音频信号和文本信息之间的互动。在可以使用的与音乐有关的许多文本来源(例如评论、元数据或社交网络反馈)中,我们集中关注歌词。歌唱声以独特的方式直接连接音频信号和文本信息,将旋律和歌词结合起来,其中语言内容与音乐乐器的抽象化相辅相成。我们的研究侧重于音频和歌词互动,以针对源分离和知情内容估算为目标。