The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.
翻译:多模态大语言模型(MLLMs)的最新发展显著提升了人工智能理解视觉模态的能力。然而,现有的评估基准仍局限于单轮问答,忽略了现实场景中多轮对话的复杂性。为弥补这一不足,我们提出了MT-Video-Bench,这是一个用于评估MLLMs在多轮对话中视频理解能力的综合性基准。具体而言,我们的MT-Video-Bench主要评估聚焦于感知力与交互性的六项核心能力,涵盖了来自不同领域的987个精心构建的多轮对话。这些能力与现实世界应用(如交互式体育分析和基于视频的多轮智能辅导)严格对齐。利用MT-Video-Bench,我们对多种当前最先进的开源及闭源MLLMs进行了广泛评估,揭示了它们在处理多轮视频对话时存在的显著性能差异与局限。该基准将公开提供,以促进未来研究。