Clover: 走向统一视频语言对齐和融合模式 (Clover: Towards A Unified Video-Language Alignment and Fusion Model)

Building a universal Video-Language model for solving various video understanding tasks (\emph{e.g.}, text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent works build the model by stacking uni-modal and cross-modal feature encoders and train it with pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. They mostly adopt different architectures to deal with different downstream tasks. We find this is because the pair-wise training cannot well \emph{align} and \emph{fuse} features from different modalities. We then introduce \textbf{Clover}\textemdash a Correlated Video-Language pre-training method\textemdash towards a universal Video-Language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from semantic masked samples and a new pair-wise ranking loss. Clover establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at \url{https://github.com/LeeYN-43/Clover}.

翻译：建立通用视频语言模型以解决各种视频理解任务(\ emph{ e. g.} 、文本视频检索、视频解答) 是一个对机器学习领域的公开挑战。为了实现这一目标,最近的一些作品通过堆叠单式和交叉式特征编码器来构建该模型, 并用双向对比性预文本任务来培训该模型。尽管该模型提供了具有吸引力的概括性, 但结果模型必须在效率和性能之间做出妥协, 它们大多采用不同的结构来应对不同的下游任务。我们发现这是因为双对式培训无法很好地 emph{ align} 和\ emph{ fuse} 不同模式的特征。为了实现这一目标, 我们随后引入了\ textbf{ Cloover{ textemdash a Corlate-Language pretragal- taglegage pre-Language 模式, 解决多种视频理解任务, 无论是业绩还是效率妥协性工作。它会改进跨模式的特征匹配和融合。我们提议通过新的三式组合- ammodlegleam- relial- com religal- real- religal- sal- sal- suplegal- suplemental- sal- sal- supleglegal- sal- sal- supleglegal- suplection- sal- supal- lection- legleglegleglegleglection- setty- suplection- settlection- settlection- straction- slection legleglection- settlection- settlection- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal legal lection- sal legal lection- sal legal- sal- sal- sal- sal- sald- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- slemental- sal- slemental

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日