AV-Edit：基于音频-视觉语义联合控制的多模态生成式音效编辑 (AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control)

Sound effect editing-modifying audio by adding, removing, or replacing elements-remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM-DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based feature gating training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.

翻译：音效编辑——通过添加、移除或替换元素来修改音频——目前仍受限于仅依赖低层级信号处理或粗略文本提示的现有方法，这通常导致灵活性有限且音频质量欠佳。为解决这一问题，我们提出AV-Edit，一种生成式音效编辑框架，通过联合利用视觉、音频和文本语义，实现对视频中现有音轨的细粒度编辑。具体而言，所提方法采用专门设计的对比式音频-视觉掩码自编码器（CAV-MAE-Edit）进行多模态预训练，学习对齐的跨模态表征。这些表征随后用于训练一个编辑型多模态扩散Transformer（MM-DiT），该模型能够通过基于相关性的特征门控训练策略，移除视觉上无关的声音并生成与视频内容一致的缺失音频元素。此外，我们构建了一个专用的基于视频的音效编辑数据集作为评估基准。实验表明，所提出的AV-Edit能够基于视觉内容生成具有精确修改的高质量音频，在音效编辑领域达到最先进的性能，并在音频生成领域展现出强大的竞争力。