Deep learning models are mostly used in an offline inference fashion. However, this strongly limits the use of these models inside audio generation setups, as most creative workflows are based on real-time digital signal processing. Although approaches based on recurrent networks can be naturally adapted to this buffer-based computation, the use of convolutions still poses some serious challenges. To tackle this issue, the use of causal streaming convolutions have been proposed. However, this requires specific complexified training and can impact the resulting audio quality. In this paper, we introduce a new method allowing to produce non-causal streaming models. This allows to make any convolutional model compatible with real-time buffer-based processing. As our method is based on a post-training reconfiguration of the model, we show that it is able to transform models trained without causal constraints into a streaming model. We show how our method can be adapted to fit complex architectures with parallel branches. To evaluate our method, we apply it on the recent RAVE model, which provides high-quality real-time audio synthesis. We test our approach on multiple music and speech datasets and show that it is faster than overlap-add methods, while having no impact on the generation quality. Finally, we introduce two open-source implementation of our work as Max/MSP and PureData externals, and as a VST audio plugin. This allows to endow traditional digital audio workstation with real-time neural audio synthesis on a laptop CPU.
翻译:深层学习模型大多用于离线推导方式。 但是,这极大地限制了这些模型在音频生成设置中的使用,因为大多数创造性工作流程都以实时数字信号处理为基础。 虽然基于经常性网络的方法可以自然地适应这种缓冲计算,但使用卷变仍然构成一些严峻的挑战。 要解决这个问题,就提出了因果流变变换方法。然而,这需要具体的复杂培训,并能够影响由此产生的音频质量。在本文中,我们引入了一种新的方法,允许生成非因果流动模型。这样可以使任何动态模型与实时缓冲处理兼容。由于我们的方法基于该模式的训练后重组,我们显示,在不因果制约的情况下,能够将经过培训的模式转换成流模式。我们展示了如何调整方法,以适应与平行分支相交的复杂结构。我们将其应用到最近的 RAVE 模式,它提供高质量的实时音频合成。我们测试了多部音频和语音数据转换方法,我们测试了与实时缓冲的实时缓冲处理方法。 由于我们的方法基于培训后的后期重新配置模式,我们能够将模型转换成一个数字- 并显示,我们以快速生成的外部生成,因此可以更快地复制工作。