Deep neural networks have recently achieved breakthroughs in sound generation with text prompts. Despite their promising performance, current text-to-sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting their performance. In this paper, we investigate the use of pre-trained AudioLDM, the state-of-the-art model for text-to-audio generation, as the backbone for sound generation. Our study demonstrates the advantages of using pre-trained models for text-to-sound generation, especially in data-scarcity scenarios. In addition, experiments show that different training strategies (e.g., training conditions) may affect the performance of AudioLDM on datasets of different scales. To facilitate future studies, we also evaluate various text-to-sound generation systems on several frequently used datasets under the same evaluation protocols, which allow fair comparisons and benchmarking of these methods on the common ground.
翻译:深神经网络近来在音频生成方面取得了突破。尽管目前的文本到声频生成模型表现良好,但是在小规模数据集(如超装)方面,当前的文本到声频生成模型面临问题,大大限制了它们的性能。在本文件中,我们调查了使用预先培训的音频LDM(即最先进的文本到音频生成模型)作为音频生成的骨干。我们的研究显示,使用预先培训的文本到声频生成模型,特别是在数据稀缺的假想中,具有优势。此外,实验还表明,不同的培训战略(如培训条件)可能会影响音频LDM在不同尺度数据集上的性能。为了便利今后的研究,我们还评估了同一评估协议下若干常用数据集的各种文本到声生成系统,这些系统使得这些方法能够在共同的基础上进行公平的比较和基准比较。</s>