The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.
翻译:自动驾驶车辆与无人机等自主系统的快速发展,迫切要求从多模态车载传感器数据中锻造出真正的空间智能。尽管基础模型在单模态场景中表现出色,但如何整合摄像头与激光雷达等异构传感器的能力以形成统一理解,仍是一个艰巨的挑战。本文提出一个全面的多模态预训练框架,系统梳理了推动该领域进展的核心技术体系。我们深入剖析基础传感器特性与学习策略间的相互作用,并评估平台专用数据集对这些技术进步的关键支撑作用。本研究的核心贡献在于构建了预训练范式的统一分类体系:从单模态基线方法,到能够为三维目标检测与语义占据预测等高级任务学习整体表征的复杂统一框架。此外,我们探究了文本输入与占据表征的融合机制,以促进开放世界感知与规划能力的发展。最后,我们指出计算效率与模型可扩展性等关键瓶颈,并提出了面向实际部署、能够实现鲁棒空间智能的通用多模态基础模型发展路线图。