Existing approaches for generating multitrack music with transformer models have been limited to either a small set of instruments or short music segments. This is partly due to the memory requirements of the lengthy input sequences necessitated by existing representations for multitrack music. In this work, we propose a compact representation that allows a diverse set of instruments while keeping a short sequence length. Using our proposed representation, we present the Multitrack Music Transformer (MTMT) for learning long-term dependencies in multitrack music. In a subjective listening test, our proposed model achieves competitive quality on unconditioned generation against two baseline models. We also show that our proposed model can generate samples that are twice as long as those produced by the baseline models, and, further, can do so in half the inference time. Moreover, we propose a new measure for analyzing musical self-attentions and show that the trained model learns to pay less attention to notes that form a dissonant interval with the current note, yet attending more to notes that are 4N beats away from current. Finally, our findings provide a novel foundation for future work exploring longer-form multitrack music generation and improving self-attentions for music. All source code and audio samples can be found at https://salu133445.github.io/mtmt/ .
翻译:通过变压器模式生成多轨音乐的现有方法已限于小一套仪器或短音段,部分是由于现有多轨音乐代表机构需要的长输入序列的记忆要求。在这项工作中,我们提议采用一个压缩代表制,允许一套不同的仪器,同时保持短顺序长度。我们利用我们提议的表述制,介绍多轨音乐变换器(MTMT),以学习多轨音乐的长期依赖性。在一次主观的倾听测试中,我们提议的模型在不附带条件的一代人与两个基线模型之间实现了竞争质量。我们还表明,我们提议的模型可以产生比基线模型所制作的样本长一倍的样本,而且可以在一半的推论时间这样做。此外,我们提出了一个新的分析音乐自我意识的措施,并表明经过培训的模型学会较少注意与当前注释形成不协调的间隔,但更多注意的是,与当前模式相比,4N的节奏比。最后,我们的发现为今后探索更成型多轨音乐生成并改进MAC13/MALS/ODORT的样本工作提供了一个新的基础。