Benefiting from large-scale datasets and pre-trained models, the field of generative models has recently gained significant momentum. However, most datasets for symbolic music are very small, which potentially limits the performance of data-driven multimodal models. An intuitive solution to this problem is to leverage pre-trained models from other modalities (e.g., natural language) to improve the performance of symbolic music-related multimodal tasks. In this paper, we carry out the first study of generating complete and semantically consistent symbolic music scores from text descriptions, and explore the efficacy of using publicly available checkpoints (i.e., BERT, GPT-2, and BART) for natural language processing in the task of text-to-music generation. Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity. We analyse the capabilities and limitations of our model to better understand the potential of language-music models.
翻译:基因模型领域从大规模数据集和经过培训的模型中受益,最近取得了巨大的势头,然而,象征性音乐的多数数据集都非常小,有可能限制数据驱动的多式联运模型的性能。这个问题的一个直观解决办法是从其他模式(例如自然语言)中利用经过培训的模型来改进与音乐有关的象征性多式联运任务的性能。在本文件中,我们进行了第一次研究,从文字描述中产生完整和语义一致的象征性音乐评分,并探索在生成文字到音乐的任务中利用公开提供的自然语言处理检查站(即BERT、GPT-2和BART)的效率。我们的实验结果表明,从使用经过培训的检查站的改进在BLEU分数和编辑距离相似性方面具有统计意义。我们分析了我们模型的能力和局限性,以更好地了解语言音乐模型的潜力。