Text-to-audio (TTA) generation can significantly benefit the media industry by reducing production costs and enhancing work efficiency. However, most current TTA models (primarily diffusion-based) suffer from slow inference speeds and high computational costs. In this paper, we introduce AudioGAN, the first successful Generative Adversarial Networks (GANs)-based TTA framework that generates audio in a single pass, thereby reducing model complexity and inference time. To overcome the inherent difficulties in training GANs, we integrate multiple ,contrastive losses and propose innovative components Single-Double-Triple (SDT) Attention and Time-Frequency Cross-Attention (TF-CA). Extensive experiments on the AudioCaps dataset demonstrate that AudioGAN achieves state-of-the-art performance while using 90% fewer parameters and running 20 times faster, synthesizing audio in under one second. These results establish AudioGAN as a practical and powerful solution for real-time TTA.
翻译:文本到音频(TTA)生成能够显著降低媒体行业的生产成本并提升工作效率,从而为该行业带来巨大益处。然而,当前大多数TTA模型(主要基于扩散模型)存在推理速度慢、计算成本高的问题。本文提出了AudioGAN,这是首个成功的基于生成对抗网络(GANs)的TTA框架,它能够单次前向传播生成音频,从而降低了模型复杂度和推理时间。为了克服训练GANs固有的困难,我们整合了多种对比损失,并提出了创新的组件:单-双-三重(SDT)注意力机制和时频交叉注意力(TF-CA)机制。在AudioCaps数据集上进行的大量实验表明,AudioGAN在实现最先进性能的同时,使用的参数减少了90%,运行速度提升了20倍,可在不到一秒的时间内合成音频。这些结果确立了AudioGAN作为一种实用且强大的实时TTA解决方案的地位。