Large Language Models (LLMs) often falter at complex planning tasks that require exploration and self-correction, as their linear reasoning process struggles to recover from early mistakes. While search algorithms like Monte Carlo Tree Search (MCTS) can explore alternatives, they are often ineffective when guided by sparse rewards and fail to leverage the rich semantic capabilities of LLMs. We introduce SPIRAL (Symbolic LLM Planning via Grounded and Reflective Search), a novel framework that embeds a cognitive architecture of three specialized LLM agents into an MCTS loop. SPIRAL's key contribution is its integrated planning pipeline where a Planner proposes creative next steps, a Simulator grounds the search by predicting realistic outcomes, and a Critic provides dense reward signals through reflection. This synergy transforms MCTS from a brute-force search into a guided, self-correcting reasoning process. On the DailyLifeAPIs and HuggingFace datasets, SPIRAL consistently outperforms the default Chain-of-Thought planning method and other state-of-the-art agents. More importantly, it substantially surpasses other state-of-the-art agents; for example, SPIRAL achieves 83.6% overall accuracy on DailyLifeAPIs, an improvement of over 16 percentage points against the next-best search framework, while also demonstrating superior token efficiency. Our work demonstrates that structuring LLM reasoning as a guided, reflective, and grounded search process yields more robust and efficient autonomous planners. The source code, full appendices, and all experimental data are available for reproducibility at the official project repository.
翻译:大语言模型(LLM)在执行需要探索与自我修正的复杂规划任务时常常表现不佳,因为其线性推理过程难以从早期错误中恢复。尽管蒙特卡洛树搜索(MCTS)等搜索算法能够探索备选方案,但在稀疏奖励信号的引导下往往效率低下,且未能充分利用LLM丰富的语义能力。本文提出SPIRAL(基于具身化与反思性搜索的符号化大语言模型规划),这是一个新颖的框架,将包含三个专用LLM智能体的认知架构嵌入到MCTS循环中。SPIRAL的核心贡献在于其集成的规划流程:规划器提出创造性的后续步骤,模拟器通过预测现实结果来具身化搜索过程,而评判器则通过反思提供密集的奖励信号。这种协同作用将MCTS从暴力搜索转变为一种有引导、可自我修正的推理过程。在DailyLifeAPIs和HuggingFace数据集上的实验表明,SPIRAL持续优于默认的思维链规划方法及其他最先进的智能体。更重要的是,其性能显著超越其他前沿方法:例如,在DailyLifeAPIs任务上SPIRAL实现了83.6%的整体准确率,较次优搜索框架提升超过16个百分点,同时展现出更优的令牌使用效率。我们的研究表明,将LLM推理构建为有引导、可反思且具身化的搜索过程,能够产生更鲁棒、更高效的自主规划器。本研究的源代码、完整附录及所有实验数据已在官方项目仓库中公开,以确保可复现性。