Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate framework. We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like, from videos labeled with only object labels. Unlike other recent visually-guided audio source separation frameworks, our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals. Specifically, we introduce weakly-supervised object segmentation in the context of sound separation. We also formulate spectrogram mask prediction using a set of learned mask bases, which combine using coefficients conditioned on the output of object segmentation , a design that facilitates separation. Extensive experiments on the MUSIC dataset show that our proposed approach outperforms state-of-the-art methods on visually guided sound source separation and sound denoising.
翻译:如何在视频的音频频道中将单个对象声音本地化和分离是一个困难的任务。 目前最先进的方法从人工混合的光谱图(称为 Mix-and-Sparate 框架)中预测声音面罩。 我们提议了一个视听共同部分, 网络从仅贴有对象标签的视频中学习单个物体的外观和声音。 与其他最近的视觉引导音源分离框架不同, 我们的架构可以以端到端的方式学习, 不需要额外的监督或约束框提案。 具体地说, 我们在音频分离中引入了微弱监控对象分割法。 我们还使用一套学习过的遮罩基, 将基于对象分割输出的系数结合起来, 一种便于分离的设计。 MUSIC 数据集的广泛实验显示, 我们拟议的方法在视觉引导声源分离和声音分解方面超越了最先进的方法。