In this paper, we consider a novel research problem, music-to-text synaesthesia. Different from the classical music tagging problem that classifies a music recording into pre-defined categories, the music-to-text synaesthesia aims to generate descriptive texts from music recordings for further understanding. Although this is a new and interesting application to the machine learning community, to our best knowledge, the existing music-related datasets do not contain the semantic descriptions on music recordings and cannot serve the music-to-text synaesthesia task. In light of this, we collect a new dataset that contains 1,955 aligned pairs of classical music recordings and text descriptions. Based on this, we build a computational model to generate sentences that can describe the content of the music recording. To tackle the highly non-discriminative classical music, we design a group topology-preservation loss in our computational model, which considers more samples as a group reference and preserves the relative topology among different samples. Extensive experimental results qualitatively and quantitatively demonstrate the effectiveness of our proposed model over five heuristics or pre-trained competitive methods and their variants on our collected dataset.
翻译:在本文中,我们考虑的是一个新的研究问题,即音乐到文字的麻醉。与古典音乐标记问题(将音乐录音分为预先定义的类别)不同的是,音乐到文字的麻醉旨在从音乐录音中产生描述性文字,以便进一步理解。虽然这是机器学习界一个新而有趣的应用,但据我们所知,现有的音乐相关数据集并不包含音乐录音的语义描述,无法为音乐到文字的麻醉工作服务。鉴于此,我们收集了一个新的数据集,其中包含1 955对古典音乐录音和文字描述的对应配对。在此基础上,我们建立了一个计算模型,以生成能够描述音乐录音内容的句子。为了解决高度非差异性的古典音乐,我们设计了一个计算模型中的群表层保存损失,该模型将更多的样本视为一个群集参考,并保护不同样品的相对地形学。广泛的实验结果质量和数量上都展示了我们提议的模型在五种超模或事先经过训练的竞争方法上的有效性,以及我们所收集数据的变式。