Contrastive learning has delivered impressive results in many audio-visual representation learning scenarios. However, existing approaches optimize for learning either \textit{global} representations useful for tasks such as classification, or \textit{local} representations useful for tasks such as audio-visual source localization and separation. While they produce satisfactory results in their intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose a versatile self-supervised approach to learn audio-visual representations that generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require fine-grained spatio-temporal information (e.g. localization). We achieve this by optimizing two cross-modal contrastive objectives that together encourage our model to learn discriminative global-local visual information given audio signals. To show that our approach learns generalizable video representations, we evaluate it on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.
翻译:反向学习在许多视听教学情景中取得了令人印象深刻的成果。然而,现有的学习方法优化了对分类等任务有用的表达方式或对视听来源本地化和分离等任务有用的表达方式。虽然这些方法在预期的下游情景中产生了令人满意的结果,但往往无法概括到最初设计的任务。在这项工作中,我们提出了一种多功能的自我监督方法,学习需要全球语义信息(如分类)和需要精细的时空信息(如本地化)的任务的视听表达方式。我们通过优化两个跨模式的对比性目标来实现这一点,这两个目标共同鼓励我们的模式学习歧视性的全球-本地视觉信息,并发出声音信号。要表明我们的方法学习了可概括的视频表达方式,我们评估了各种下游情景,包括行动/声音分类、唇读、深底基检测和音源本地化。