Music creation is typically composed of two parts: composing the musical score, and then performing the score with instruments to make sounds. While recent work has made much progress in automatic music generation in the symbolic domain, few attempts have been made to build an AI model that can render realistic music audio from musical scores. Directly synthesizing audio with sound sample libraries often leads to mechanical and deadpan results, since musical scores do not contain performance-level information, such as subtle changes in timing and dynamics. Moreover, while the task may sound like a text-to-speech synthesis problem, there are fundamental differences since music audio has rich polyphonic sounds. To build such an AI performer, we propose in this paper a deep convolutional model that learns in an end-to-end manner the score-to-audio mapping between a symbolic representation of music called the piano rolls and an audio representation of music called the spectrograms. The model consists of two subnets: the ContourNet, which uses a U-Net structure to learn the correspondence between piano rolls and spectrograms and to give an initial result; and the TextureNet, which further uses a multi-band residual network to refine the result by adding the spectral texture of overtones and timbre. We train the model to generate music clips of the violin, cello, and flute, with a dataset of moderate size. We also present the result of a user study that shows our model achieves higher mean opinion score (MOS) in naturalness and emotional expressivity than a WaveNet-based model and two commercial sound libraries. We open our source code at https://github.com/bwang514/PerformanceNet
翻译:音乐创建通常由两个部分组成: 组成音乐评分, 然后用音响工具进行评分。 虽然最近的工作在象征性域的自动音乐生成方面取得了很大进展, 但很少尝试建立能从音乐评分中产生现实音乐音乐音频的AI模型。 直接将音频与音频样本库合成, 通常会导致机械和无效果的结果, 因为音乐评分并不包含性能水平信息, 比如时间和动态的微妙变化。 此外, 虽然任务听起来像是一个中性文本到语音合成问题, 但也有根本的不同之处, 因为音乐音频有丰富的多功能调音频声音。 为了建立这样的AI表演者, 我们在本文件中提议了一个深层次的共变动模型模型, 以端到端的方式学习音乐音频调的绘图, 在被称为钢琴卷和音乐谱图的音频代表中, 模式由两个子网组成: 调网络, 它使用一种U- 网络结构来学习钢琴滚动和光谱谱的直径谱, 并给一个初始结果; 文本SextSloverealS- 网络, 将一个更进一步用多层次的网络生成数据。