Modern approaches to sound synthesis using deep neural networks are hard to control, especially when fine-grained conditioning information is not available, hindering their adoption by musicians. In this paper, we cast the generation of individual instrumental notes as an inpainting-based task, introducing novel and unique ways to iteratively shape sounds. To this end, we propose a two-step approach: first, we adapt the VQ-VAE-2 image generation architecture to spectrograms in order to convert real-valued spectrograms into compact discrete codemaps, we then implement token-masked Transformers for the inpainting-based generation of these codemaps. We apply the proposed architecture on the NSynth dataset on masked resampling tasks. Most crucially, we open-source an interactive web interface to transform sounds by inpainting, for artists and practitioners alike, opening up to new, creative uses.
翻译:使用深层神经网络进行声音合成的现代方法很难控制,特别是当没有精密的调制信息,从而阻碍音乐家采用这些信息时。在本文中,我们把生成单个工具说明作为基于油漆的任务,引入创新和独特的方式来迭接地塑造声音。为此,我们提出一个两步方法:首先,我们把VQ-VAE-2图像生成架构改制成光谱,以便把实际价值的光谱转换成紧凑的离散代码图,然后,我们用象征性的巨型变形器来生成这些代码图。我们用NESynth数据集应用了蒙面重塑任务的拟议结构。最重要的是,我们开源一个互动的网络界面,通过油漆来改造声音,对艺术家和从业人员来说,打开新的、创造性的用途。