Contrastive learning has delivered impressive results for various tasks in the self-supervised regime. However, existing approaches optimize for learning representations specific to downstream scenarios, i.e., \textit{global} representations suitable for tasks such as classification or \textit{local} representations for tasks such as detection and localization. While they produce satisfactory results in the intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require local fine-grained spatio-temporal information (e.g., localization). We achieve this by optimizing two contrastive objectives that together encourage our model to learn global-local visual information given audio signals. We show that the two objectives mutually improve the generalizability of the learned global-local representations, significantly outperforming their disjointly learned counterparts. We demonstrate our approach on various tasks including action/sound classification, lip reading, deepfake detection, event and sound localization (https://github.com/yunyikristy/global\_local).
翻译:在自我监督的制度中,反向学习为各种任务带来了令人印象深刻的成果,然而,现有方法优化了针对下游情景的学习表现方式,即适合诸如检测和本地化等任务分类或本地化等任务的学习表现方式,即:\textit{global}代表方式。虽然在预期的下游情景中产生了令人满意的结果,但它们往往无法概括到最初设计不适于完成的任务。在这项工作中,我们提议学习视频表现方式,既概括到需要全球语义信息(如分类)和需要当地精细的时空信息(如地方化)的任务。我们通过优化两个对比性目标来实现这一点,共同鼓励我们学习全球-本地视觉信息的模式,提供音频信号。我们表明,这两个目标相互改进了所学的全球-本地代表性的通用性,大大优于其不连贯的对应方。我们在各种任务上展示了我们的做法,包括行动/声学分类、唇读、深底底底部检测、事件和声音本地化(https://givyum/globaliz/gloatality)。