用于条件波形合成的自动递减 GAN (Chunked Autoregressive GAN for Conditional Waveform Synthesis)

Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the waveform during each forward pass. Relative to prior state-of- the-art GAN-based models, our proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for real-time or interactive applications, and maintains or improves subjective quality.

翻译：条件波形合成模型学会了文本、光谱或 MIDI 等给定的音波形状的分布。这些系统采用深基因模型, 通过顺序( 反向) 或平行( 非反向) 取样来模拟波形。生成的对立网络( GANs) 已经成为非反向波形合成的一种常见选择。然而, 最先进的 GAN 模型在进行中光谱转换时会生成一些艺术品。在本文中, 我们证明这些文物与发电机无法学习准确的音频和周期性相对应。我们显示, 简单的音频和周期性调节不足以减少与使用自向反的差。我们讨论自向偏移为学习瞬时频和波状合成之间的关系而提供的偏差偏差。我们讨论的是, 这种向偏差甚至在每次前过程对波状大块进行自动递增取样时, 也存在。相对先前的状态的GAN 模型, 我们提议的模型, 铜- 和周期性周期性相对于前方的GAN 质量模型, 我们提议的模型, 将自动自动递增或自动速度降低的 GARC 的的的和快速降低的速度和方向的降低方向, 和将的将降低方向的方向的降低的的的和方向的方向的的的的的降低降低方向方向方向降低降低和方向方向方向方向降低方向方向方向降低的方向降低的的和方向方向降低的的降低降低降低和方向方向方向的降低降低降低降低降低降低降低降低降低降低降低或或方向方向方向方向降低方向方向方向方向方向或降低或或方向方向降低降低方向降低或降低的的方向降低降低降低方向降低降低降低降低降低降低降低方向方向方向方向或或