This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sentences into common spaces for semantic similarity computation. In this work, we concentrate on video representation learning, an essential component for text-to-video retrieval. Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos, which consists of two branches: a previewing branch and an intensive-reading branch. The previewing branch is designed to briefly capture the overview information of videos, while the intensive-reading branch is designed to obtain more in-depth information. Moreover, the intensive-reading branch is aware of the video overview captured by the previewing branch. Such holistic information is found to be useful for the intensive-reading branch to extract more fine-grained features. Extensive experiments on three datasets are conducted, where our model RIVRL achieves a new state-of-the-art on TGIF and VATEX. Moreover, on MSR-VTT, our model using two video features shows comparable performance to the state-of-the-art using seven video features and even outperforms models pre-trained on the large-scale HowTo100M dataset.
翻译:本文旨在执行文本到视频检索的任务, 以自然语言的句子形式进行查询, 请它从大量未贴标签的视频中检索与给定询问有关的音义性视频。 任务的成功取决于跨模式演示学习, 将视频和句子投射到共同的语义相似计算空间。 在这项工作中, 我们集中关注视频演示学习, 这是文本到视频检索的一个基本组成部分。 在人类阅读战略的启发下, 我们建议使用阅读战略启发视觉演示学习( RIVRL) 来代表由两个分支组成的视频: 预览分支和一个密集阅读分支。 预览分支的成功取决于跨模式的演示。 任务的成功取决于跨模式演示模式学习, 即将视频和句都投放入共同的语义相似计算空间。 此外, 密集阅读分支了解预览分支所拍摄的视频概览。 这种整体信息有助于密集阅读分支提取更精细的视觉演示文理演示文则在三种视频模型上进行广泛的实验, 使用我们VXTF 和 RTF 的大规模图像模型, 展示了我们新的VTF 和MT 的进度, 在三个图像模型上, 的 R- tr 正在 显示我们的新模型上如何 和 R- tr 的进度 展示到 R- tral 的 显示我们的新的图像到 R- tr 和 R- tr 的进度 的进度 的进度