Learning symbolic music representations, especially disentangled representations with probabilistic interpretations, has been shown to benefit both music understanding and generation. However, most models are only applicable to short-term music, while learning long-term music representations remains a challenging task. We have seen several studies attempting to learn hierarchical representations directly in an end-to-end manner, but these models have not been able to achieve the desired results and the training process is not stable. In this paper, we propose a novel approach to learn long-term symbolic music representations through contextual constraints. First, we use contrastive learning to pre-train a long-term representation by constraining its difference from the short-term representation (extracted by an off-the-shelf model). Then, we fine-tune the long-term representation by a hierarchical prediction model such that a good long-term representation (e.g., an 8-bar representation) can reconstruct the corresponding short-term ones (e.g., the 2-bar representations within the 8-bar range). Experiments show that our method stabilizes the training and the fine-tuning steps. In addition, the designed contextual constraints benefit both reconstruction and disentanglement, significantly outperforming the baselines.
翻译:事实证明,学习象征性的音乐表现,特别是不相干且具有概率解释的象征性表现,对音乐的理解和代代都有好处;然而,大多数模式仅适用于短期音乐,而学习长期音乐表现仍是一项艰巨的任务;我们看到一些研究试图以端对端的方式直接学习等级代表性,但这些模式未能取得预期的结果,培训过程也不稳定;在本文件中,我们提出一种新的方法,通过背景限制来学习长期象征性的音乐表现;首先,我们利用对比性学习,通过限制其与短期代表的区别(由现成模式所吸引的)来培养长期代表制;然后,我们通过一个等级性预测模型微调长期代表制,这样一种良好的长期代表制(例如8巴代表制)能够重建相应的短期代表制(例如,8巴代表制在8巴范围内的2巴代表制)。 实验表明,我们的方法稳定了培训和调整步骤。此外,我们设计的背景性限制是大幅的基线,重建与扭曲。