We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.
翻译:我们引入了具有长期一致性的高质量音频生成框架AudioLM。 音频LM 将输入音频映射为一系列离散的象征物,并将音频生成作为一种语言模型任务。 我们展示了现有音效代谢器如何在重建质量和长期结构之间提供不同的权衡,我们提出了实现这两个目标的混合代谢方案。 也就是说, 我们利用隐蔽语言模式的离散启动,在音频上预先训练过,以捕捉长期结构和神经音频代码生成的离散代码,以实现高质量的合成。 通过对大型原始音波形公司的培训,音频M 学会产生自然和连贯的延续, 短短时间的提示。 在进行关于语言的培训时, 没有笔录或注释, 音频LM 生成同步和语义上可信的语音延续, 同时保持声频特性和对隐微音器的描述。 此外, 我们展示了我们的方法如何超越语言的延伸, 其方式是生成连贯的钢琴音乐延续, 尽管没有任何象征性的音乐表现。