As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT -- the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at https://pku-intermt.github.io .
翻译:随着多模态大模型(MLLMs)在各类挑战性任务中不断取得进展,一个关键问题随之浮现:当前模型仍缺失哪些核心能力?人类学习的一个关键方面在于与环境的持续交互——这不仅限于语言,还涉及多模态的理解与生成。为更接近人类水平的智能,模型必须同样支持多轮、多模态的交互。具体而言,模型应能理解交错的多模态上下文,并在持续的对话中进行连贯的回应。在本工作中,我们通过InterMT——首个基于真实人类反馈的多轮多模态交互偏好数据集——进行了初步探索。在此探索中,我们特别强调了人类监督的重要性,引入了专家标注以指导整个过程,其动机在于当前MLLMs普遍缺乏此类复杂的交互能力。InterMT将人类偏好从全局和局部两个层面细分为九个子维度,包含15.6k条提示、52.6k个多轮对话实例以及32.4k个人工标注的偏好对。为弥补多模态理解与生成能力的不足,我们引入了一种智能体工作流,利用工具增强的MLLMs来构建多轮问答实例。为进一步推进此目标,我们提出了InterMT-Bench,用于评估MLLMs在辅助评审者处理多轮多模态任务方面的能力。我们通过评审员调节等应用展示了InterMT的实用性,并进一步揭示了评审模型的多轮扩展规律。我们希望开源我们的数据能够助力推动当前MLLMs向更高阶段对齐的相关研究。我们的项目网站位于 https://pku-intermt.github.io。