超越具身草垛中的针：长上下文推理的环境、架构与训练考量 (Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning)

We introduce $\infty$-THOR, a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI. $\infty$-THOR provides: (1) a generation framework for synthesizing scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack, where multiple scattered clues across extended trajectories test agents' long-context reasoning ability; and (3) a long-horizon dataset and benchmark suite featuring complex tasks that span hundreds of environment steps, each paired with ground-truth action sequences. To enable this capability, we explore architectural adaptations, including interleaved Goal-State-Action modeling, context extension techniques, and Context Parallelism, to equip LLM-based agents for extreme long-context reasoning and interaction. Experimental results and analyses highlight the challenges posed by our benchmark and provide insights into training strategies and model behaviors under long-horizon conditions. Our work provides a foundation for the next generation of embodied AI systems capable of robust, long-term reasoning and planning.

翻译：我们提出了$\infty$-THOR，一个用于长视野具身任务的新框架，旨在推进具身人工智能中的长上下文理解。$\infty$-THOR提供：（1）一个用于合成可扩展、可复现且无限长视野轨迹的生成框架；（2）一项新颖的具身问答任务——“具身草垛中的针”，其中散布在长轨迹中的多个线索用于测试智能体的长上下文推理能力；（3）一个长视野数据集和基准测试套件，包含跨越数百个环境步骤的复杂任务，每个任务都配有真实动作序列。为实现此能力，我们探索了架构适配，包括交错的目标-状态-动作建模、上下文扩展技术以及上下文并行性，以装备基于LLM的智能体进行极端长上下文推理与交互。实验结果与分析突显了我们基准测试带来的挑战，并为长视野条件下的训练策略和模型行为提供了洞见。我们的工作为下一代能够进行鲁棒、长期推理与规划的具身AI系统奠定了基础。