Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.
翻译:生成模型在根据简短文本描述合成高保真音频方面已取得显著进展。然而,利用自然语言编辑现有音频在很大程度上仍未得到充分探索。现有方法要么需要提供编辑后音频的完整描述,要么受限于预定义的编辑指令,缺乏灵活性。在本研究中,我们提出了SAO-Instruct,这是一个基于Stable Audio Open的模型,能够使用任何自由形式的自然语言指令编辑音频片段。为训练我们的模型,我们利用Prompt-to-Prompt、DDPM反演和手动编辑流程构建了一个音频编辑三元组数据集(输入音频、编辑指令、输出音频)。尽管部分训练数据是合成的,但我们的模型能够很好地泛化到真实环境中的音频片段和未见过的编辑指令。我们证明SAO-Instruct在客观指标上取得了有竞争力的性能,并在主观听音测试中优于其他音频编辑方法。为促进未来研究,我们公开了代码和模型权重。