Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto-tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of music audio content in a human-like manner. To this end, we present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention. Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs through a multimodal encoder and leverages pre-training on audio data to obtain representations that effectively capture and summarise musical features in the input. Evaluation of the generated captions through automatic metrics shows that our method outperforms a baseline designed for non-music audio captioning. Through an ablation study, we unveil that this performance boost can be mainly attributed to pre-training of the audio encoder, while other design choices - modality fusion, decoding strategy and the use of attention - contribute only marginally. Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval.
翻译:通过深层学习,基于内容的音乐信息检索取得了迅速的进展。目前对高级音乐描述采用的方法通常使用分类模型,例如自动调试或流派和情绪分类。在这项工作中,我们提议通过音频字幕处理音乐描述,定义为以类似人的方式生成音乐音频内容自然语言描述的任务。为此,我们介绍第一种音乐音频字幕模型,MusCaps,由编码器解码器和时间关注器组成。我们的方法是将同级和经常性神经网络结构结合起来,通过多式联运编码器和对音频数据进行预先培训,以获得有效捕捉和总结投入中音乐特征的演示。通过自动计量对所产生的字幕进行评估表明,我们的方法超越了为非音乐音频字幕设计的基线。我们通过一个校正研究,公布这一性提升主要归因于音频导师的培训前,而其他设计选择-模式的调试、解译策略和对音频数据进行预处理,以及将注意力从需要的音频定位和理解性能转换转化为对语言信息的理解,我们只能进行微的分类。