Models for audio generation are typically trained on hours of recordings. Here, we illustrate that capturing the essence of an audio source is typically possible from as little as a few tens of seconds from a single training signal. Specifically, we present a GAN-based generative model that can be trained on one short audio signal from any domain (e.g. speech, music, etc.) and does not require pre-training or any other form of external supervision. Once trained, our model can generate random samples of arbitrary duration that maintain semantic similarity to the training waveform, yet exhibit new compositions of its audio primitives. This enables a long line of interesting applications, including generating new jazz improvisations or new a-cappella rap variants based on a single short example, producing coherent modifications to famous songs (e.g. adding a new verse to a Beatles song based solely on the original recording), filling-in of missing parts (inpainting), extending the bandwidth of a speech signal (super-resolution), and enhancing old recordings without access to any clean training example. We show that in all cases, no more than 20 seconds of training audio commonly suffice for our model to achieve state-of-the-art results. This is despite its complete lack of prior knowledge about the nature of audio signals in general.
翻译:音频生成模型通常在录音时数上培训。 这里, 我们举例说明, 获取音频源的精髓, 通常从一个培训信号的几秒以内就有可能实现。 具体地说, 我们展示了一个基于 GAN 的基因化模型, 该模型可以在任何领域的一个短音信号( 如语音、 音乐等) 上接受培训, 不需要事先培训或任何其他形式的外部监督。 一旦经过培训, 我们的模型可以生成任意持续时间的随机样本, 保持与培训波形相似的语义宽度, 但却展现出其音频原始的新构成。 这样可以产生一长串有趣的应用, 包括产生新的爵士即兴即兴表演或新的快餐饶舌歌曲变异器, 能够对著名歌曲进行连贯的修改( 例如, 仅仅根据原始录音记录, 在贝得尔斯歌上添加新的一首曲, 填充缺失部分( 油漆), 扩展语音信号的带宽度( 超分辨率), 并增强旧的录音记录, 无法获取任何干净的培训实例 。 我们显示, 在所有情况下, 不超过20 秒的音频预知觉中, 完全的音频信号 足以实现我们一般 。