We consider a novel task of automatically generating text descriptions of music. Compared with other well-established text generation tasks such as image caption, the scarcity of well-paired music and text datasets makes it a much more challenging task. In this paper, we exploit the crowd-sourced music comments to construct a new dataset and propose a sequence-to-sequence model to generate text descriptions of music. More concretely, we use the dilated convolutional layer as the basic component of the encoder and a memory based recurrent neural network as the decoder. To enhance the authenticity and thematicity of generated texts, we further propose to fine-tune the model with a discriminator as well as a novel topic evaluator. To measure the quality of generated texts, we also propose two new evaluation metrics, which are more aligned with human evaluation than traditional metrics such as BLEU. Experimental results verify that our model is capable of generating fluent and meaningful comments while containing thematic and content information of the original music.
翻译:我们认为自动生成音乐文字描述是一项新颖的任务。 与其他成熟的文字生成任务相比,如图像字幕、精美音乐和文本数据集的稀缺性等,这更加具有挑战性。 在本文中,我们利用众源音乐评论来构建一个新的数据集,并提议一个按顺序顺序排列的模型来生成音乐文字描述。更具体地说,我们使用膨胀的变动层作为编码器的基本组成部分,以记忆为基础的经常性神经网络作为解码器。为了提高生成文本的真实性和主题性,我们进一步提议用一个歧视者和一个新的专题评价员来微调模型。为了衡量生成文本的质量,我们还提出了两个新的评价指标,这些指标比传统指标(如BLEU)更符合人类评价。实验结果证实我们的模型能够生成流畅和有意义的评论,同时包含原始音乐的主题和内容信息。