Video-text retrieval is a class of cross-modal representation learning problems, where the goal is to select the video which corresponds to the text query between a given text query and a pool of candidate videos. The contrastive paradigm of vision-language pretraining has shown promising success with large-scale datasets and unified transformer architecture, and demonstrated the power of a joint latent space. Despite this, the intrinsic divergence between the visual domain and textual domain is still far from being eliminated, and projecting different modalities into a joint latent space might result in the distorting of the information inside the single modality. To overcome the above issue, we present a novel mechanism for learning the translation relationship from a source modality space $\mathcal{S}$ to a target modality space $\mathcal{T}$ without the need for a joint latent space, which bridges the gap between visual and textual domains. Furthermore, to keep cycle consistency between translations, we adopt a cycle loss involving both forward translations from $\mathcal{S}$ to the predicted target space $\mathcal{T'}$, and backward translations from $\mathcal{T'}$ back to $\mathcal{S}$. Extensive experiments conducted on MSR-VTT, MSVD, and DiDeMo datasets demonstrate the superiority and effectiveness of our LaT approach compared with vanilla state-of-the-art methods.
翻译:视频文本检索是一个跨模式的学习问题, 目标是选择与文本查询和候选视频库之间的文本查询相对应的视频。 视觉语言预培训的对比性范例显示,大型数据集和统一的变压器结构取得了大成功,并展示了共同潜在空间的力量。 尽管如此, 视觉域和文本域之间的内在差异仍然远远没有消除, 将不同模式投向一个共同的潜在空间可能会扭曲单一模式内的信息。 为了克服上述问题, 我们提出了一个新机制, 从源模式空间 $\ mathcal{S} 向目标模式空间学习翻译关系 $\ mathcal{T} $, 而不需要联合潜在空间来弥合视觉域和文本域之间的差距。 此外, 为了保持翻译的周期一致性, 我们采用了循环损失, 从$\mathcal{S} 到预测目标空间 $\mathcal{Dal{Dadival} 以及从$\macalQRVTA_S} 向后展示我们的数据方法。