Sound effect editing-modifying audio by adding, removing, or replacing elements-remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM-DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based feature gating training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.
翻译:音效编辑——通过添加、移除或替换元素来修改音频——目前仍受限于仅依赖低层级信号处理或粗略文本提示的现有方法,这通常导致灵活性有限且音频质量欠佳。为解决这一问题,我们提出AV-Edit,一种生成式音效编辑框架,通过联合利用视觉、音频和文本语义,实现对视频中现有音轨的细粒度编辑。具体而言,所提方法采用专门设计的对比式音频-视觉掩码自编码器(CAV-MAE-Edit)进行多模态预训练,学习对齐的跨模态表征。这些表征随后用于训练一个编辑型多模态扩散Transformer(MM-DiT),该模型能够通过基于相关性的特征门控训练策略,移除视觉上无关的声音并生成与视频内容一致的缺失音频元素。此外,我们构建了一个专用的基于视频的音效编辑数据集作为评估基准。实验表明,所提出的AV-Edit能够基于视觉内容生成具有精确修改的高质量音频,在音效编辑领域达到最先进的性能,并在音频生成领域展现出强大的竞争力。