We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires generating semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on AVSBench compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at http://www.avlbench.opennlplab.cn.
翻译:我们提出一个新的问题,叫做视听分解(AVS),其中的目标是输出在图像框架时产生声音的物体的像素级地图。为了便利此项研究,我们建立了第一个视听分解基准,即AVSBench,为听觉视频中的探测对象提供像素性说明。它包含三个子集:AVSBench-object(Sing-源子子集,多源子子子集)和AVSBench-smantic(Smantic-labs子集) 。因此,要研究三个设置:1) 以单一声音源的半超级视听分解;2) 以多种声音源的完全超超超的视听分解,以及3) 完全监控的视听分解。 头两个环境需要生成声学对象的二元遮罩, 显示与音频相匹配的方法, 而第三种设置进一步需要生成显示天文性地图, 以显示对象的建筑类别。要处理这些问题,我们提议在视觉分解分析模型中采用一个新的基路路路路路路路规则,,我们用来在视听分解路路路路路路路路路路路路路路路路路路路路路,我们用来进行。