音频-视觉语义分割的深入研究 (A Closer Look at Audio-Visual Semantic Segmentation)

Audio-visual segmentation (AVS) is a complex task that involves accurately segmenting the corresponding sounding object based on audio-visual queries. Successful audio-visual learning requires two essential components: 1) an unbiased dataset with high-quality pixel-level multi-class labels, and 2) a model capable of effectively linking audio information with its corresponding visual object. However, these two requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new strategy to build cost-effective and relatively unbiased audio-visual semantic segmentation benchmarks. Our strategy, called Visual Post-production (VPO), explores the observation that it is not necessary to have explicit audio-visual pairs extracted from single video sources to build such benchmarks. We also refine the previously proposed AVSBench to transform it into the audio-visual semantic segmentation benchmark AVSBench-Single+. Furthermore, this paper introduces a new pixel-wise audio-visual contrastive learning method to enable a better generalisation of the model beyond the training set. We verify the validity of the VPO strategy by showing that state-of-the-art (SOTA) models trained with datasets built by matching audio and visual data from different sources or with datasets containing audio and visual data from the same video source produce almost the same accuracy. Then, using the proposed VPO benchmarks and AVSBench-Single+, we show that our method produces more accurate audio-visual semantic segmentation than SOTA models. Code and dataset will be available.

翻译：音频-视觉分割（AVS）是一项复杂的任务，它涉及到根据音频-视觉查询准确地分割相应的声音对象。成功的音频-视觉学习需要两个关键组件：1）一个带有高质量像素级多类标签的无偏数据集，和2）一个能够有效地链接音频信息与其对应的视觉对象的模型。然而，目前的方法只部分地解决了这两个要求，训练集包含有偏的音频-视觉数据，并且模型在超出这个有偏训练集的情况下泛化能力很差。在这项工作中，我们提出了一种新的策略来构建成本效益高且相对无偏的音频-视觉语义分割基准。我们的策略被称为视觉后期制作（VPO），它探索了这样一个观察结果，即构建此类基准时不必从单个视频来源中提取显式的音频-视觉对。我们还改进了之前提出的AVSBench，将其转化为音频-视觉语义分割基准AVSBench-Single+。此外，本文还介绍了一种新的像素级音频-视觉对比学习方法，以实现模型在训练集之外更好的泛化能力。我们通过展示使用来自不同来源的音频和视觉数据匹配构建的数据集或包含来自同一视频来源的音频和视觉数据的数据集训练的最先进的（SOTA）模型几乎产生相同的准确性来验证 VPO 策略的有效性。然后，使用提出的 VPO 基准和 AVSBench-Single+，我们展示了我们的方法比 SOTA 模型产生更准确的音频-视觉语义分割。代码和数据集将会提供。