Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.
翻译:基于大语言模型(LLM)的任务规划及相应的人类演示数据在具身人工智能中可能存在噪声,包含不必要的动作、冗余的导航以及逻辑错误,从而降低策略质量。我们提出一种迭代验证框架,其中评判LLM对动作序列进行评析,规划LLM则应用修订,从而逐步生成更清晰、空间更连贯的轨迹。与基于规则的方法不同,我们的方法依赖于自然语言提示,能够广泛泛化至包括无关动作、矛盾步骤和缺失步骤在内的多种错误类型。在TEACh具身AI数据集中手动标注的动作集上,我们的框架在四种最先进的LLM(GPT o4-mini、DeepSeek-R1、Gemini 2.5、LLaMA 4 Scout)上实现了高达90%的召回率和100%的精确率。优化循环收敛迅速,96.5%的序列最多仅需三次迭代,同时提升了时间效率和空间动作组织。关键的是,该方法保留了人类错误恢复模式而非将其消除,为未来研究鲁棒的纠错行为提供了支持。通过将规划验证确立为LLM在空间规划和动作优化方面的一项可靠能力,我们为具身AI中的模仿学习提供了获取更高质量训练数据的可扩展路径。