Clover: 走向统一视频语言对齐和融合模式 (Clover: Towards A Unified Video-Language Alignment and Fusion Model)

Building a universal video-language model for solving various video understanding tasks (e.g., text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent attempts train the models, usually consisting of uni-modal and cross-modal feature encoders, with supervised or pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. We argue the flaws are caused by their pre-training strategies\textemdash they cannot well align and fuse features from different modalities simultaneously. We then introduce Clover -- a Correlated Video-Language pre-training method -- towards a universal video-language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from masked samples and a novel pair-wise ranking loss. It establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at https://github.com/LeeYN-43/Clover.

翻译：建立通用的视频语言模式,以解决各种视频理解任务(如文字视频检索、视频问答),这是对机器学习领域的公开挑战。为了实现这一目标,最近尝试培训模型,这些模型通常由单式和跨式特写编码器组成,由监督或对称的对比性前文本任务组成。尽管具有吸引力的概括性,但所产生的模型必须在效率和性能之间作出妥协。我们争辩说,这些缺陷是其培训前战略(如文本视频检索、视频解答)和不同模式的接轨功能同时产生的。我们随后引入了Clover -- -- 一种与视频-语言相关的Cor相关培训前方法 -- -- 以通用视频-语言模式模式解决多种视频理解任务,既无业绩也无效率妥协。它改进了跨模式特征的配合和融合,通过新的三模式调整任务。此外,我们提议通过从掩码样本中学习和新颖的配对排序损失,加强三模式的配合。我们随后在多个下游任务上设置了新的状态,包括用于零式和低调/升级的视频-C前版本和微调的版本任务。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日