Symbolic Music Generation relies on the contextual representation capabilities of the generative model, where the most prevalent approach is the Transformer-based model. Not only that, the learning of long-term context is also related to the dynamic segmentation of musical structures, i.e. intro, verse and chorus, which is currently overlooked by the research community. In this paper, we propose a multi-scale Transformer, which uses coarse-decoder and fine-decoders to model the contexts at the global and section-level, respectively. Concretely, we designed a Fragment Scope Localization layer to syncopate the music into sections, which were later used to pre-train fine-decoders. After that, we designed a Music Style Normalization layer to transfer the style information from the original sections to the generated sections to achieve consistency in music style. The generated sections are combined in the aggregation layer and fine-tuned by the coarse decoder. Our model is evaluated on two open MIDI datasets, and experiments show that our model outperforms the best contemporary symbolic music generative models. More excitingly, visual evaluation shows that our model is superior in melody reuse, resulting in more realistic music.
翻译:符号音乐生成依赖于基因模型的背景代表能力, 其中最流行的方法是变异模型。 不仅如此, 长期背景的学习也与音乐结构的动态分解有关, 即内流、 音乐和合唱, 目前研究界对此视而不见。 在本文中, 我们提议一个多尺度的变异器, 分别使用粗解码器和细解码器来建模全球和节级的背景。 具体地说, 我们设计了一个分裂范围本地化层, 将音乐同步到各部分, 而这些部分后来被用于预导细解密器。 之后, 我们设计了一个音乐样式正常化层, 将原部分的风格信息传输到生成部分, 以实现音乐风格的一致性。 生成的各部分在组合层中结合, 由粗解码进行精细的调整。 我们的模型在两个开放的 MIDI 数据集上进行了评估, 实验显示我们的模型优于当代最好的象征性音乐基因化模型 。 更令人兴奋的是, 视觉评估显示在更现实的音乐再利用模型中, 。