Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments we show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.
翻译:在这项工作中,我们提出了一个自我监督的学习框架,能够从各种模式中学习精细的颗粒度,这些模式包括视觉物体或口头文字所代表的概念或事件。我们的框架依赖于通过矢量量化创造的分散的嵌入空间,这种空间在不同模式中共享。除了共享嵌入空间外,我们提议了一个跨模式代码匹配目标,迫使不同观点(模式)的代表在离散嵌入空间上拥有类似的分布,这样跨模式物体/行动的本地化可以在没有直接监督的情况下进行。在我们的实验中,我们表明拟议的离散多模式精细度代表(例如像素/字/字/框架)可以补充高层次的概要表述(例如,视频/感应/波状),以改进跨模式检索任务的业绩。我们还注意到,离散代表利用单个分组代表同一结构模式。