The recent success of Large Language Models (LLMs) has prompted the extension to the multimodal domain, developing image-text Multimodal LLMs (MLLMs) and then video-text models. In this work, we investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos. To address this problem, prior works have developed complex task-specific architectures, novel modules to embed time into MLLMs, or leveraged additional input signals such as video transcripts to best encode contextual and temporal information. We find that most of these efforts are surpassed by a much simpler design. We introduce Chrono, a universal sequence blueprint that can be applied to any image-text pretrained MLLM. In extensive experiments spanning different MLLM architectures and sizes, finetuning and zero-shot settings, we demonstrate new state-of-the-art results in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions, as well as in grounded video question answering on NExT-GQA.
翻译:近期大型语言模型(LLMs)的成功推动了其向多模态领域的扩展,相继开发出图像-文本多模态大语言模型(MLLMs)及视频-文本模型。本研究通过探索视频中的时间定位任务,深入探讨视频-语言模型在上下文与时间理解方面面临的挑战。为解决此问题,先前研究或构建了复杂的任务专用架构,或设计了新颖模块以将时间信息嵌入MLLMs,亦或利用视频转录文本等额外输入信号来优化上下文与时间信息的编码。我们发现,这些复杂方案大多可被一种更为简洁的设计所超越。本文提出Chrono——一种可应用于任何图像-文本预训练MLLM的通用序列蓝图。通过在多种MLLM架构与规模、微调与零样本设置下的大规模实验,我们在广泛使用的基准数据集Charades-STA、QVHighlights和ActivityNet Captions上实现了时序片段检索任务的最新最优性能,同时在NExT-GQA数据集上的接地视频问答任务中也取得了领先结果。