Generating music with deep neural networks has been an area of active research in recent years. While the quality of generated samples has been steadily increasing, most methods are only able to exert minimal control over the generated sequence, if any. We propose the self-supervised \emph{description-to-sequence} task, which allows for fine-grained controllable generation on a global level by extracting high-level features about the target sequence and learning the conditional distribution of sequences given the corresponding high-level description in a sequence-to-sequence modelling setup. We train FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying \emph{description-to-sequence} modelling to symbolic music. By combining learned high level features with domain knowledge, which acts as a strong inductive bias, the model achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.
翻译:近年来,通过深层神经网络生成音乐一直是积极研究的一个领域。虽然生成样本的质量一直在稳步提高,但大多数方法只能对生成序列进行最低限度的控制。我们建议采用自监督的\emph{ description-to-seconcerce}任务,通过提取关于目标序列的高层次特征和学习在序列到顺序建模设置中根据相应的高层次描述而有条件地分配序列,从而能够在全球一级进行精细的可控生成。我们通过应用 emph{ description-to-secondce} 模型对象征性音乐进行微小的控制,对FIGARO (FINEGARO) (通过关注基础的ROBust控制制成的FIGARO) 进行了培训。通过将高层次特征与域知识相结合(作为强烈的诱导偏差),模型在可控的象征性音乐生成和普及方面取得了最新的结果,远远超出培训分布范围。