Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.
翻译:用于音频的深基因模型大大改进了许多语言和音乐相关任务中最先进的音频模型。然而,由于原始波形模型仍是一项固有的困难任务,因此音频模型要么在计算上密集,依靠低取样率,要么复杂以控制或限制可能信号的性质。在这些模型中,变式自动电算器(VAE)通过暴露潜在变量来控制生成,尽管它们通常受到低合成质量的影响。在本文中,我们引入了实时音频变异自动音频转换器(RAVE),允许快速和高质量的音频合成。我们引入了新型的两阶段培训程序,即代表性学习和对抗性微调。我们表明,利用对潜在空间的训练后分析,可以直接控制重建的忠诚性和代表的紧凑性。我们借助对原始波状的多波段变异组合,我们展示了我们的模型首次能够生成48kHz音频信号,同时比实时同步同步同步同步同步。我们采用新型计算机计算机同步和高20倍的速度运行。我们用定量和定性模型来评估我们现有的语音和质信号传输工具的升级质量,从而展示了我们现有的高级和感化模型。