Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.
翻译:多数视频和语言代表式学习方法采用对比式学习方法,例如CLIP,根据文本-视频配对的语义相似性,将视频和文本特征投射到共同的潜在空间中,而根据视频-视频配对的语义相似性,这种所学的共享潜在空间往往不是最佳的,视觉和文字代表制之间的模式差距无法完全消除。在本文中,我们提议对三个基准文本-最大化对比学习数据集进行广泛的实验,证明我们的 EMCL 能够学习比以往方法更具有歧视性的视频和语言表达式。具体地说,我们使用期望-最大化算法为潜在空间找到一套紧凑的基点,其特征可以简明地体现为这些基点的线性组合。这种视频和语言代表制的分解功能降低了潜在空间的级别,从而导致表达语义的力量增加。关于三个基准文本-视频检索数据集的广泛实验证明,我们的 EMCL 能够学习比以往方法更具有歧视性的视频和语言表达式,并且大大超越了以往在所有度度度尺度上采用的最新方法。更令人欣慰的是,在任何培训中采用的任何培训模式都不能被联合采用。更容易地将这一方法纳入任何培训模式。