Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.
翻译:在鸡尾酒派对中,对声音物体进行本地化的偏差,即混合声音场景,对于人类来说是司空见惯的,但对于机器来说仍然很困难。在本文中,我们提出一个两阶段学习框架,以进行自我监督的类觉察物体本地化。首先,我们提议通过将候选人声音本地化结果汇总到单一来源场景中来学习强大的物体表达方式。然后,在鸡尾酒派对情景中,通过参考事先获得的物体知识,生成了有意识的物体本地化图,因此,通过匹配视听对象类别的分布来选择声音和视觉对象,将视听一致性视为自我监督的信号。现实和合成的鸡尾酒派对视频的实验结果表明,我们的模型在过滤静态物体和指出不同类别声音物体的位置方面优异。代码可在https://github.com/Dtaoo/Dicriminative-Sounding-Objects-本地化上查阅。