Human perception and experience of music is highly context-dependent. Contextual variability contributes to differences in how we interpret and interact with music, challenging the design of robust models for information retrieval. Incorporating multimodal context from diverse sources provides a promising approach toward modeling this variability. Music presented in media such as movies and music videos provide rich multimodal context that modulates underlying human experiences. However, such context modeling is underexplored, as it requires large amounts of multimodal data along with relevant annotations. Self-supervised learning can help address these challenges by automatically extracting rich, high-level correspondences between different modalities, hence alleviating the need for fine-grained annotations at scale. In this study, we propose VCMR -- Video-Conditioned Music Representations, a contrastive learning framework that learns music representations from audio and the accompanying music videos. The contextual visual information enhances representations of music audio, as evaluated on the downstream task of music tagging. Experimental results show that the proposed framework can contribute additive robustness to audio representations and indicates to what extent musical elements are affected or determined by visual context.
翻译:人类对音乐的认识和经验高度依赖环境,背景差异导致我们如何解释和与音乐互动的差异,挑战信息检索的可靠模型的设计。从多种来源纳入多式联运背景为模拟这种差异提供了一种很有希望的方法。在电影和音乐视频等媒体上展示的音乐提供了丰富的多式联运背景,从而调节了人类的经验。然而,这种背景建模没有得到充分探讨,因为它需要大量的多式联运数据和相关说明。自我监督的学习有助于应对这些挑战,通过自动提取不同模式之间的丰富、高层次的通信,从而减轻对精细的批注的需求。在本研究中,我们提议VCMR -- -- 视频有素养的音乐演示,这是一个对比式学习框架,从音乐和随附的音乐视频中学习音乐表现。背景的视觉信息加强了音乐音频的表达方式,正如对下游音乐标记任务的评价那样。实验结果显示,拟议的框架可以有助于增加音频表达的稳健性,并表明音乐元素在多大程度上受到视觉环境的影响或决定。