Symbolic music generation aims to generate music scores automatically. A recent trend is to use Transformer or its variants in music generation, which is, however, suboptimal, because the full attention cannot efficiently model the typically long music sequences (e.g., over 10,000 tokens), and the existing models have shortcomings in generating musical repetition structures. In this paper, we propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation. Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures (e.g., the previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost. The advantages are two-fold. First, it can capture both music structure-related correlations via the fine-grained attention, and other contextual information via the coarse-grained attention. Second, it is efficient and can model over 3X longer music sequences compared to its full-attention counterpart. Both objective and subjective experimental results demonstrate its ability to generate long music sequences with high quality and better structures.
翻译:象征性音乐生成旨在自动生成音乐评分。 最近的趋势是,在音乐生成中使用变形器或其变异器,然而,这种变形器并不理想,因为完全关注无法有效地模拟典型的长音序列(如10,000多个符号),而现有的模型在生成音乐重复结构方面也有缺点。在本文中,我们建议Musefer,一个具有微小和粗粗微关注的新型变形器,用于音乐生成。具体地说,在细微的注意下,一个特定栏的象征直接关注所有与音乐结构最相关的栏杆的象征物(例如,前一、二、四和八条,通过类似统计选择);在生成音乐重复结构时,完全注意粗微,只关注其他栏子的组合,而不是每条子的组合,以减少计算成本。其优点是两重。首先,它可以通过精细微的注意来捕捉到与音乐结构有关的所有符号(如前一、第二、第二、第四、第四、第四和第三段背景信息,通过更深层的音乐质量结构,并展示其完整、第三感官级的顺序,显示其完整、第二、第二、第二、第三感光力、第二、第二、第二、第二、第二、第三感感应、第三感感应、第三感感感、第二,它能、第三感感、第二,以及更感应,显示其整个的感应的感应结果。