Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and output accuracy. Evaluations in six visual reasoning benchmarks, covering two modalities and four task types, demonstrate that DT-R1 consistently achieves improvements over state-of-the-art task-specific models. DT-R1 opens a new direction where visual reasoning emerges from reinforcement learning with digital twin representations.
翻译:视觉推理可能要求模型解释图像和视频,并针对隐含的文本查询生成多样化的输出,从像素级分割掩码到自然语言描述。现有方法依赖于任务特定架构的监督微调。例如,推理分割、定位、摘要和视觉问答各自需要不同的模型设计与训练,这阻碍了统一解决方案的实现,并限制了跨任务与跨模态的泛化能力。因此,我们提出了DT-R1,一种强化学习框架,它训练大语言模型构建复杂多模态视觉输入的数字孪生表示,并基于这些高层次表示进行推理,以此作为视觉推理的统一方法。具体而言,我们使用GRPO训练DT-R1,并引入一种新颖的奖励机制,以同时验证结构完整性与输出准确性。在涵盖两种模态和四种任务类型的六个视觉推理基准测试中,评估结果表明,DT-R1相较于最先进的任务特定模型持续取得了性能提升。DT-R1为视觉推理开辟了一个新方向,即通过数字孪生表示的强化学习实现推理能力。