Vision-Language-Action (VLA) models are driving a revolution in robotics, enabling machines to understand instructions and interact with the physical world. This field is exploding with new models and datasets, making it both exciting and challenging to keep pace with. This survey offers a clear and structured guide to the VLA landscape. We design it to follow the natural learning path of a researcher: we start with the basic Modules of any VLA model, trace the history through key Milestones, and then dive deep into the core Challenges that define recent research frontier. Our main contribution is a detailed breakdown of the five biggest challenges in: (1) Representation, (2) Execution, (3) Generalization, (4) Safety, and (5) Dataset and Evaluation. This structure mirrors the developmental roadmap of a generalist agent: establishing the fundamental perception-action loop, scaling capabilities across diverse embodiments and environments, and finally ensuring trustworthy deployment-all supported by the essential data infrastructure. For each of them, we review existing approaches and highlight future opportunities. We position this paper as both a foundational guide for newcomers and a strategic roadmap for experienced researchers, with the dual aim of accelerating learning and inspiring new ideas in embodied intelligence. A live version of this survey, with continuous updates, is maintained on our \href{https://suyuz1.github.io/VLA-Survey-Anatomy/}{project page}.
翻译:视觉-语言-动作(VLA)模型正在推动机器人领域的革命,使机器能够理解指令并与物理世界交互。该领域正涌现大量新模型与数据集,既令人振奋又使跟踪进展颇具挑战。本综述为VLA领域提供了一份清晰且结构化的指南。我们将其设计为遵循研究者的自然学习路径:从任何VLA模型的基础模块出发,追溯关键里程碑的历史脉络,继而深入探讨界定近期研究前沿的核心挑战。我们的主要贡献在于对五大挑战的详细剖析:(1)表征,(2)执行,(3)泛化,(4)安全性,以及(5)数据集与评估。这一结构映射了通用智能体的发展路线图:建立基本的感知-行动循环,在不同具身形式与环境中扩展能力,最终确保可信部署——所有这些都离不开关键的数据基础设施支撑。针对每个方面,我们回顾了现有方法并指出未来机遇。本文定位为既是新入行者的基础指南,也是经验丰富研究者的战略路线图,旨在加速具身智能领域的学习进程并激发新思路。本综述的动态版本将持续更新,维护于我们的\href{https://suyuz1.github.io/VLA-Survey-Anatomy/}{项目页面}。