Sound localization aims to find the source of the audio signal in the visual scene. However, it is labor-intensive to annotate the correlations between the signals sampled from the audio and visual modalities, thus making it difficult to supervise the learning of a machine for this task. In this work, we propose an iterative contrastive learning framework that requires no data annotations. At each iteration, the proposed method takes the 1) localization results in images predicted in the previous iteration, and 2) semantic relationships inferred from the audio signals as the pseudo-labels. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video (intra-frame sampling) as well as the association between those extracted across videos (inter-frame relation). Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio. Quantitative and qualitative experimental results demonstrate that the proposed framework performs favorably against existing unsupervised and weakly-supervised methods on the sound localization task.
翻译:声音本地化的目的是在视觉场景中找到音频信号的来源。 然而,要说明从视听模式中抽样的信号之间的相互关系,需要花费大量人力,才能说明从视听模式中抽样的信号之间的相互关系,从而难以监督为此任务对机器的学习。 在这项工作中,我们建议了一个不需数据说明的迭代对比学习框架。在每次迭代中,拟议方法采用1:1的本地化结果,在先前迭代中预测的图像中得出,2)从音频信号中推断出作为假标签的语义关系。我们然后使用假标签来学习从同一视频中抽样的视觉和音频信号(跨框架取样)之间的相互关系,以及从视频中提取的视频(跨框架关系)之间的联系。我们的迭代战略逐渐鼓励声音对象的本地化,并减少非声音区域与参考音频之间的关联。定量和定性实验结果显示,拟议框架与声音本地化任务上现有的不可靠和薄弱的超强方法相比,效果良好。