Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.
翻译:多模式自我监督的学习越来越受到越来越多的关注,因为它不仅允许在没有人监督的情况下培训大型网络,而且允许搜索和检索各种模式的数据。在这方面,本文件提出一个自我监督的培训框架,以学习一个共同的多式联运嵌入空间,除了在不同模式中共享演示外,还强制将一系列相似的立体实例组合在一起。为此,我们扩展了实例级对比学习的概念,在培训管道中采用多式联运组合步骤,以捕捉不同模式的语义相似性。由此形成的嵌入空间使得能够对所有模式的样本进行检索,甚至从看不见的数据集和不同领域进行检索。为了评估我们的方法,我们培训了我们关于“ HowTo100M”数据集的模型,并评价其在两个具有挑战性的领域即文字到视频检索和时间动作定位的零速率检索能力,在四个不同的数据集中展示最新结果。