In this paper, we propose SinTra, an auto-regressive sequential generative model that can learn from a single multi-track music segment, to generate coherent, aesthetic, and variable polyphonic music of multi-instruments with an arbitrary length of bar. For this task, to ensure the relevance of generated samples and training music, we present a novel pitch-group representation. SinTra, consisting of a pyramid of Transformer-XL with a multi-scale training strategy, can learn both the musical structure and the relative positional relationship between notes of the single training music segment. Additionally, for maintaining the inter-track correlation, we use the convolution operation to process multi-track music, and when decoding, the tracks are independent to each other to prevent interference. We evaluate SinTra with both subjective study and objective metrics. The comparison results show that our framework can learn information from a single music segment more sufficiently than Music Transformer. Also the comparison between SinTra and its variant, i.e., the single-stage SinTra with the first stage only, shows that the pyramid structure can effectively suppress overly-fragmented notes.
翻译:在本文中,我们提出SinTra,这是一个自动递减顺序的基因模型,可以从一个多轨音乐段中学习,以产生连贯、审美和多调多调的多语种音乐,具有任意长度的条纹。对于这项任务,为了确保所制作的样品和培训音乐的相关性,我们提出一个新的投球组代表。SinTra,由具有多级培训战略的变形-XL金字塔组成,既可以学习音乐结构,也可以学习单级培训音乐段音调之间的相对位置关系。此外,为了保持跨轨关系,我们利用演进操作处理多轨音乐,在解码时,轨道是彼此独立的,以防止干扰。我们用主观研究和客观的衡量尺度对SinTra进行评价。比较结果表明,我们的框架可以从一个单一音乐段中学习比音乐变形器更充分的信息。此外,SinTra与变形器之间的比较,即单级SinTra与仅第一阶段的比较表明,金字塔结构可以有效地抑制过度的笔记。