Recent advances in vision, language, and multimodal learning have substantially accelerated progress in robotic foundation models, with robot manipulation remaining a central and challenging problem. This survey examines robot manipulation from an algorithmic perspective and organizes recent learning-based approaches within a unified abstraction of high-level planning and low-level control. At the high level, we extend the classical notion of task planning to include reasoning over language, code, motion, affordances, and 3D representations, emphasizing their role in structured and long-horizon decision making. At the low level, we propose a training-paradigm-oriented taxonomy for learning-based control, organizing existing methods along input modeling, latent representation learning, and policy learning. Finally, we identify open challenges and prospective research directions related to scalability, data efficiency, multimodal physical interaction, and safety. Together, these analyses aim to clarify the design space of modern foundation models for robotic manipulation.
翻译:近年来,视觉、语言及多模态学习的进展显著加速了机器人基础模型的发展,其中机器人操作仍是一个核心且具挑战性的问题。本综述从算法视角审视机器人操作,并将近期基于学习的方法组织在一个统一的高层规划与底层控制抽象框架内。在高层,我们将经典的任务规划概念扩展至涵盖对语言、代码、运动、功能可供性及三维表征的推理,强调其在结构化与长时域决策中的作用。在底层,我们提出一种面向训练范式的基于学习的控制分类法,沿输入建模、潜在表征学习与策略学习三个维度组织现有方法。最后,我们指出了与可扩展性、数据效率、多模态物理交互及安全性相关的开放挑战与未来研究方向。这些分析共同旨在厘清现代机器人操作基础模型的设计空间。