Building a universal Video-Language model for solving various video understanding tasks (\emph{e.g.}, text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent works build the model by stacking uni-modal and cross-modal feature encoders and train it with pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. They mostly adopt different architectures to deal with different downstream tasks. We find this is because the pair-wise training cannot well \emph{align} and \emph{fuse} features from different modalities. We then introduce \textbf{Clover}\textemdash a Correlated Video-Language pre-training method\textemdash towards a universal Video-Language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from semantic masked samples and a new pair-wise ranking loss. Clover establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at \url{https://github.com/LeeYN-43/Clover}.
翻译:建立通用视频语言模型以解决各种视频理解任务(\ emph{ e. g.} 、 文本视频检索、 视频解答) 是一个对机器学习领域的公开挑战。 为了实现这一目标,最近的一些作品通过堆叠单式和交叉式特征编码器来构建该模型, 并用双向对比性预文本任务来培训该模型。 尽管该模型提供了具有吸引力的概括性, 但结果模型必须在效率和性能之间做出妥协, 它们大多采用不同的结构来应对不同的下游任务。 我们发现这是因为双对式培训无法很好地 emph{ align} 和\ emph{ fuse} 不同模式的特征。 为了实现这一目标, 我们随后引入了\ textbf{ Cloover{ textemdash a Corlate-Language pretragal- taglegage pre-Language 模式, 解决多种视频理解任务, 无论是业绩还是效率妥协性工作。 它会改进跨模式的特征匹配和融合。 我们提议通过新的三式组合- ammodlegleam- relial- com religal- real- religal- sal- sal- suplegal- suplemental- sal- sal- supleglegal- sal- sal- supleglegal- suplection- sal- supal- lection- legleglegleglegleglection- setty- suplection- settlection- settlection- straction- slection legleglection- settlection- settlection- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal legal lection- sal legal lection- sal legal- sal- sal- sal- sal- sald- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- sal- slemental- sal- slemental