MT-Video-Bench：一个用于评估多模态大语言模型在多轮对话中视频理解能力的综合性基准 (MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues)

Yaning Pan,Zekun Wang,Qianqian Xie,Yongqian Wen,Yuanxing Zhang,Guohui Zhang,Haoxuan Hu,Zhiyu Pan,Yibing Huang,Zhidong Gan,Yonghong Lin,An Ping,Tianhao Peng,Jiaheng Liu

from arxiv, Project Website: https://github.com/NJU-LINK/MT-Video-Bench

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

翻译：多模态大语言模型（MLLMs）的最新发展显著提升了人工智能理解视觉模态的能力。然而，现有的评估基准仍局限于单轮问答，忽略了现实场景中多轮对话的复杂性。为弥补这一不足，我们提出了MT-Video-Bench，这是一个用于评估MLLMs在多轮对话中视频理解能力的综合性基准。具体而言，我们的MT-Video-Bench主要评估聚焦于感知力与交互性的六项核心能力，涵盖了来自不同领域的987个精心构建的多轮对话。这些能力与现实世界应用（如交互式体育分析和基于视频的多轮智能辅导）严格对齐。利用MT-Video-Bench，我们对多种当前最先进的开源及闭源MLLMs进行了广泛评估，揭示了它们在处理多轮视频对话时存在的显著性能差异与局限。该基准将公开提供，以促进未来研究。