Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract visual features from raw pixels in an end-to-end fashion. However, these methods operate at frame-level directly and thus overlook the spatio-temporal structure of objects in video, which yet has a strong synergy with nouns in textual descriptions. In this work, we propose a simple yet effective module for video-text representation learning, namely RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs. Given a video, our module (1) first quantizes visual features into semantic clusters, then (2) generates learnable masks and uses them to aggregate the features belonging to the same semantic region, and finally (3) models the interactions between different aggregated regions. In contrast to using off-the-shelf object detectors, our proposed module does not require explicit supervision and is much more computationally efficient. We pre-train the proposed approach on the public WebVid2M and CC3M datasets. Extensive evaluations on four downstream video-text retrieval benchmarks clearly demonstrate the effectiveness of our RegionLearner. The code will be available at https://github.com/ruiyan1995/Region_Learner.
翻译:培训前的视频-Text旨在通过调整视觉和文字信息之间的语义学,从大型视频-文字对应中学习可转让的图像-文本对应的可转让表达方式,通过调整视觉和文字信息的语义,从原始像素中提取视觉特征;然而,这些方法直接在框架一级运作,从而忽略了视频中对象的时空结构,而视频对象的时空结构与文字描述中的名词有着很强的协同作用。在这项工作中,我们提议了一个视频-文字表述学习简单而有效的模块,即区域Learner,它可以在大型视频-文字配对培训前考虑到对象的结构。鉴于视频,我们的模块(1) 首次将视觉特征量化为语义组,然后(2) 生成可学习的面具,并使用这些面具汇总属于同一语义区域的特征,最后(3) 模拟不同汇总区域的相互作用。与使用现成物体探测器相比,我们提议的模块不需要明确的监督和更具计算效率。我们在公共网络-视频-视频-M 和 CC 数据库数据库的清晰版本检索中,我们将在公开的视频- Reb Refru 数据库数据库数据库数据库中进行拟议的方法上进行预先分析。