转导式视觉编程：从经验中演化工具库以进行空间推理 (Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning)

Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at https://transductive-visualprogram.github.io/.

翻译：三维场景中的空间推理需要精确的几何计算，这对视觉-语言模型构成了挑战。视觉编程通过将问题分解为调用专用工具的步骤来解决这一难题，然而现有方法要么依赖固定的工具集，要么在解决问题前进行推测性的工具归纳，导致程序欠佳且归纳出的工具利用率低下。我们提出了转导式视觉编程（TVP），这是一种新颖的框架，它从自身经验而非推测中构建新工具。TVP首先使用基本工具解决问题，同时将经验性解决方案累积到示例库中，然后从这些程序中抽象出重复出现的模式，形成可重用的高层级工具，存入一个不断演化的工具库。这使得TVP能够利用从经验中学习到的日益强大的工具来处理新问题。在Omni3D-Bench上，TVP实现了最先进的性能，比GPT-4o高出22%，比之前最佳的视觉编程系统高出11%。我们转导学习到的工具作为核心程序依赖的使用频率是归纳创建工具的5倍，证明了更有效的工具发现与重用。演化出的工具在未见过的空间任务上也表现出强大的泛化能力，在SpatialScore-Hard集合的基准测试中取得了优越的性能，且无需任何针对测试集的修改。我们的工作确立了经验驱动的转导式工具创建作为一种强大的范式，用于构建能够有效应对挑战性空间推理任务的自演化视觉编程智能体。我们在https://transductive-visualprogram.github.io/ 发布了代码。