微弱监督视频代表学习双校准网络 (Bi-Calibration Networks for Weakly-Supervised Video Representation Learning)

The leverage of large volumes of web videos paired with the searched queries or surrounding texts (e.g., title) offers an economic and extensible alternative to supervised video representation learning. Nevertheless, modeling such weakly visual-textual connection is not trivial due to query polysemy (i.e., many possible meanings for a query) and text isomorphism (i.e., same syntactic structure of different text). In this paper, we introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning. Specifically, we present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa. Technically, BCN executes clustering on all the titles of the videos searched by an identical query and takes the centroid of each cluster as a text prototype. The query vocabulary is built directly on query words. The video-to-text/video-to-query projections over text prototypes/query vocabulary then start the text-to-query or query-to-text calibration to estimate the amendment to query or text. We also devise a selection scheme to balance the two corrections. Two large-scale web video datasets paired with query and title for each video are newly collected for weakly-supervised video representation learning, which are named as YOVO-3M and YOVO-10M, respectively. The video features of BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks. More remarkably, BCN trained on the larger set of 10M web videos with further fine-tuning leads to 1.6%, and 1.8% gains in top-1 accuracy on Kinetics-400, and Something-Something V2 datasets over the state-of-the-art TDN, and ACTION-Net methods with ImageNet pre-training. Source code and datasets are available at \url{https://github.com/FuchenUSTC/BCN}.

翻译：大量网络视频的杠杆与搜索查询或周围文本(如标题)相配的大批网络视频的杠杆(如网络标题)提供了一种经济和可扩展的替代视频演示学习。然而,这种微弱的视觉-文字连接的模型并不是微不足道的,因为质询多元性(即查询的许多可能的含义)和文本是无形态的(即不同文本的相同合成结构)。在本文中,我们引入了对查询和文本之间相互校准的新设计,以推进微弱超强的视频演示学习。具体地说,我们展示了B-CN-1校准网络(BCN)的功能,新颖地将两次校准从文本到查询和反校验的校准结果组合起来。在技术上,BCN将所有由相同查询搜索搜索的视频标题集中起来,将每个集集集的中间体作为文本原型。查询词汇直接建在查询词上。视频对文本原型/克格的预测,然后在文本-CN-1的模型/克格-校准网格-级网络服务器网络上启动文本-Sloy-read-al-alb-cload-cal-reckMal-deal-modealal-deal-deal-modeal-modeal-deal-deal-deal-deal-modeal-deal-deal-deal-modal-mas,每个数据库,每个选择了数据库数据库数据库数据库数据库数据库数据库数据库数据到新版本数据选择了数据库到新版本数据选择。