Agent2World：通过自适应多智能体反馈学习生成符号世界模型 (Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback)

Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors arising from interactive execution. In this paper, we propose Agent2World, a tool-augmented multi-agent framework that achieves strong inference-time world-model generation and also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback. Agent2World follows a three-stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation-based validation. Agent2World demonstrates superior inference-time performance across three benchmarks spanning both Planning Domain Definition Language (PDDL) and executable code representations, achieving consistent state-of-the-art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior-aware adaptive feedback that yields multi-turn training trajectories. The model fine-tuned on these trajectories substantially improves world-model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: https://agent2world.github.io.

翻译：符号世界模型（例如PDDL领域或可执行模拟器）是基于模型的规划的核心，但训练大型语言模型生成此类世界模型受限于缺乏大规模可验证的监督数据。现有方法主要依赖静态验证方法，无法捕捉交互执行中产生的行为级错误。本文提出Agent2World，一个工具增强的多智能体框架，通过在多智能体反馈中锚定生成过程，不仅实现了强大的推理时世界模型生成能力，还可作为监督微调的数据引擎。Agent2World采用三阶段流程：（i）深度研究智能体通过网页搜索进行知识综合，以填补规范缺口；（ii）模型开发智能体实现可执行世界模型；（iii）专业测试团队执行自适应单元测试与基于模拟的验证。在涵盖规划领域定义语言（PDDL）和可执行代码表示的三个基准测试中，Agent2World均展现出卓越的推理时性能，取得了持续领先的结果。除推理外，测试团队为模型开发智能体提供交互环境，通过行为感知的自适应反馈生成多轮训练轨迹。基于这些轨迹微调的模型显著提升了世界模型生成能力，相比训练前同一模型平均获得30.95%的相对性能提升。项目页面：https://agent2world.github.io。