Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a diffusion-based generative model for general audio synthesis, named DAG, which deals with full-band signals end-to-end in the waveform domain. Results show the superiority of DAG over existing label-conditioned generators in terms of both quality and diversity. More specifically, when compared to the state of the art, the band-limited and full-band versions of DAG achieve relative improvements that go up to 40 and 65%, respectively. We believe DAG is flexible enough to accommodate different conditioning schemas while providing good quality synthesis.
翻译:最近的工作表明深层基因化模型有能力从一个标签中处理一般音频合成,产生各种冲动性、线性和环境声音,这些模型使用带宽信号,由于自动递减方法,这些模型通常由预先训练的潜潜伏编码器和(或)若干级联模块符合。在这项工作中,我们提议了一个一般音频合成的基于扩散的基因化模型,名为DAG,它处理波形域的全带信号端对端。结果显示DAG在质量和多样性方面优于现有的有标签的生成器。更具体地说,与艺术现状相比,DAG的带宽和全带版本取得了相对的改进,分别达到40%和65%。我们认为DAG具有足够的灵活性,可以适应不同的调控系统,同时提供高质量的合成。