Video-and-language pre-training has shown promising results for learning generalizable representations. Most existing approaches usually model video and text in an implicit manner, without considering explicit structural representations of the multi-modal content. We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities. There are related works that propose object-aware approaches to inject similar knowledge as inputs. However, the existing methods usually fail to effectively utilize such knowledge as regularizations to shape a superior cross-modal representation space. To this end, we propose a Cross-modaL knOwledge-enhanced Pre-training (CLOP) method with Knowledge Regularizations. There are two key designs of ours: 1) a simple yet effective Structural Knowledge Prediction (SKP) task to pull together the latent representations of similar videos; and 2) a novel Knowledge-guided sampling approach for Contrastive Learning (KCL) to push apart cross-modal hard negative samples. We evaluate our method on four text-video retrieval tasks and one multi-choice QA task. The experiments show clear improvements, outperforming prior works by a substantial margin. Besides, we provide ablations and insights of how our methods affect the latent representation space, demonstrating the value of incorporating knowledge regularizations into video-and-language pre-training.
翻译:培训前的视频和语言展示显示,在学习一般代表方面,取得了有希望的成果。大多数现有做法通常是以隐含的方式模拟视频和文本,而没有考虑对多种模式内容的明确结构性表述。我们用结构知识来表示这种表述形式,这种结构知识体现了对多种颗粒的丰富语义。有些相关作品建议以目标认知方式将类似知识作为投入注入。然而,现有方法通常未能有效利用诸如规范等知识来形成一个高超的跨模式代表空间。为此,我们建议采用Cross-modaL knOwledge-enchead Produced Produced (CLOP) (CLOP) 方法,并采用知识正规化。我们有两个关键设计:1) 简单而有效的结构知识预测(SKP) 任务,以整合类似视频的潜在表述;2) 对比性学习(KCLL) 新型知识导向抽样方法,以推开跨模式的负面样本。我们评估了我们关于四种文本检索任务和多选择的QA任务的方法。实验展示了清晰的改进,并超越了我们以往的视野,展示了对空间改革工作的理解。