通过端到端强化学习训练多图像视觉智能体 (Training Multi-Image Vision Agents via End2End Reinforcement Learning)

Recent VLM-based agents aim to replicate OpenAI O3's ``thinking with images" via tool use, but most open-source methods limit input to a single image, falling short on real-world multi-image QA tasks. To address this, we propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning dedicated for complex multi-image tasks. By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs to fully activate the tool-use potential of the base VLM. Through manual verification, we obtain MIFG-QA, comprising 10k samples for training and evaluation. With deeper reasoning steps, VLMs may increasingly ignore visual inputs. We therefore develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content during inference. Benefiting from our well-designed action-trajectory two-level mask strategy, IMAgent achieves stable tool use behavior via pure RL training without requiring costly supervised fine-tuning data. Extensive experiments demonstrate that IMAgent maintains strong performance on existing single-image benchmarks while achieving substantial improvements on our proposed multi-image dataset, with our analysis providing actionable insights for the research community. Codes and data will be released soon.

翻译：近期基于视觉语言模型（VLM）的智能体旨在通过工具使用复现OpenAI O3的“图像思维”能力，但多数开源方法将输入限制为单张图像，在真实世界的多图像问答任务中表现不足。为解决这一问题，我们提出了IMAgent——一种通过端到端强化学习训练的开源视觉智能体，专门用于复杂的多图像任务。通过利用多智能体系统，我们生成具有挑战性且视觉丰富的多图像问答对，以充分激活基础VLM的工具使用潜力。经人工验证，我们构建了包含1万个训练与评估样本的MIFG-QA数据集。随着推理步骤加深，VLM可能逐渐忽略视觉输入。为此，我们开发了两种专用工具用于视觉反思与确认，使模型在推理过程中能主动将注意力重新分配至图像内容。得益于精心设计的动作-轨迹双层掩码策略，IMAgent通过纯强化学习训练实现了稳定的工具使用行为，无需昂贵的监督微调数据。大量实验表明，IMAgent在现有单图像基准测试中保持强劲性能，同时在我们提出的多图像数据集上取得显著提升，相关分析为研究社区提供了可操作的见解。代码与数据即将发布。