This paper proposes a new self-attention based model for music score infilling, i.e., to generate a polyphonic music sequence that fills in the gap between given past and future contexts. While existing approaches can only fill in a short segment with a fixed number of notes, or a fixed time span between the past and future contexts, our model can infill a variable number of notes (up to 128) for different time spans. We achieve so with three major technical contributions. First, we adapt XLNet, an autoregressive model originally proposed for unsupervised model pre-training, to music score infilling. Second, we propose a new, musically specialized positional encoding called relative bar encoding that better informs the model of notes' position within the past and future context. Third, to capitalize relative bar encoding, we perform look-ahead onset prediction to predict the onset of a note one time step before predicting the other attributes of the note. We compare our proposed model with two strong baselines and show that our model is superior in both objective and subjective analyses.
翻译:本文提出一个新的基于自我注意的音乐评分模式, 即: 生成一个多声调音乐序列, 填补特定过去和今后背景之间的空白。 虽然现有方法只能填补一个短段, 有固定的注数, 或过去和今后背景之间的固定时间间隔, 我们的模式可以填充不同时间跨度的可变注数( 最多128个) 。 我们通过三大技术贡献来做到这一点。 首先, 我们调整了XLNet, 即一个自动递减模式, 最初为未受监督的模式预设培训而提出的自动递减模式, 以填补音乐评分 。 其次, 我们提出一个新的音乐专用位置编码, 称为相对条码, 更好地为说明过去和今后背景下的注数模型的位置提供参考。 第三, 为了利用相对条码, 我们进行直观的开始预测, 在预测注的其他属性之前, 提前一步预测注注的开始时间。 我们用两个强的基线对提议模型进行比较, 并显示我们的模型在客观和主观分析中都优。