We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a novel method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.
翻译:我们提议探讨一个新的问题,即视听部分(AVS),其目标是输出在图像框架时产生声音的物体像素级图。为了便利这一研究,我们建造了第一个视听部分基准(AVSBench),为听觉视频中的声音对象提供像素解说。根据这个基准,我们研究了两个设置:1)半监督的视听部分,有一个单一声音源,2)完全监督的视听部分,有多个声音源。为了处理AVS问题,我们提议了一种新颖的方法,使用时间性像素的视听互动模块输入音频部分,作为视觉部分过程的指导。我们还设计了一种正规化的损失,以鼓励在培训期间进行视听制图。关于AVSBench的定量和定性实验将我们的方法与相关任务中的若干现有方法进行比较,表明拟议的方法有望在音频和像素视觉部分之间搭建桥梁。《守则》可在 httpss/AVGIAB/SOSPLUPR. http://SOSPRAB/SOSPLUB.