Building a universal video-language model for solving various video understanding tasks (e.g., text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent attempts train the models, usually consisting of uni-modal and cross-modal feature encoders, with supervised or pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. We argue the flaws are caused by their pre-training strategies\textemdash they cannot well align and fuse features from different modalities simultaneously. We then introduce Clover -- a Correlated Video-Language pre-training method -- towards a universal video-language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from masked samples and a novel pair-wise ranking loss. It establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at https://github.com/LeeYN-43/Clover.
翻译:建立通用的视频语言模式,以解决各种视频理解任务(如文字视频检索、视频问答),这是对机器学习领域的公开挑战。为了实现这一目标,最近尝试培训模型,这些模型通常由单式和跨式特写编码器组成,由监督或对称的对比性前文本任务组成。尽管具有吸引力的概括性,但所产生的模型必须在效率和性能之间作出妥协。我们争辩说,这些缺陷是其培训前战略(如文本视频检索、视频解答)和不同模式的接轨功能同时产生的。我们随后引入了Clover -- -- 一种与视频-语言相关的Cor相关培训前方法 -- -- 以通用视频-语言模式模式解决多种视频理解任务,既无业绩也无效率妥协。它改进了跨模式特征的配合和融合,通过新的三模式调整任务。此外,我们提议通过从掩码样本中学习和新颖的配对排序损失,加强三模式的配合。我们随后在多个下游任务上设置了新的状态,包括用于零式和低调/升级的视频-C前版本和微调的版本任务。