We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://felixkreuk.github.io/audiogen
翻译:我们处理以描述文本说明为条件制作音频样本的问题。 在这项工作中, 我们提议 AaudioGen, 一种自动递增的基因模型, 生成以文本输入为条件的音频样本。 音频Gen 运行的是一个学习的离散的音频代表。 文本到音频生成的任务提出了多重挑战。 由于音频通过介质进行音频访问的方式, 区分“ objects” 可能会是一项困难的任务( 例如, 将多个声音同时发言的人分隔开来 ) 。 现实世界记录条件( 如, 背景噪音、 变异等) 使得这一问题更加复杂。 斯carce 文本说明会施加另一个限制, 限制对音频模型进行缩放的能力。 最后, 高频调音频的模型需要以高采样速度编码音频, 导致极长的序列。 为了缓解上述挑战, 我们建议采用一种强化技术, 混合不同音频样本, 使模型内部学习不同的来源。 我们整理了10个数据集, 包含不同种类的音频和文字描述说明, 来处理文本数据点的稀缺。 。 为了更快的引用, 我们探索, 我们探索, 我们探索使用一个更精确的路径, 使用一个更精确的路径, 使用一个更精确的校正序的校程的校程 。</s>