DeepEyesV2：迈向具身多模态模型 (DeepEyesV2: Toward Agentic Multimodal Model)

Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

翻译：具身多模态模型不仅应能理解文本与图像，还需主动调用外部工具（如代码执行环境与网络搜索），并将这些操作整合至推理过程中。本文介绍了DeepEyesV2，并从数据构建、训练方法与模型评估三个维度探讨如何构建具身多模态模型。我们发现，仅依靠直接强化学习难以诱导出稳健的工具使用行为。这一现象促使我们提出两阶段训练流程：冷启动阶段以建立工具使用模式，强化学习阶段进一步优化工具调用。我们构建了一个多样化、具有适度挑战性的训练数据集，特别包含工具使用具有显著收益的示例。此外，我们提出了RealX-Bench——一个专为评估现实世界多模态推理而设计的综合基准测试，其本质要求融合感知、搜索与推理等多项能力。我们在RealX-Bench及其他代表性基准上评估DeepEyesV2，证明了其在现实世界理解、数学推理及搜索密集型任务中的有效性。值得注意的是，DeepEyesV2展现出任务自适应的工具调用特性：倾向于对感知任务使用图像操作，对推理任务使用数值计算。强化学习进一步实现了复杂工具组合，并使模型能根据上下文选择性调用工具。我们希望本研究能为学界开发具身多模态模型提供参考。