End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.
翻译:基于视觉语言模型(VLM)的端到端自动驾驶方法,凭借其在大规模预训练中获得的通用视觉理解能力和强大的推理能力,正经历快速发展。然而,我们发现当前的VLM难以理解细粒度的三维空间关系,而这正是与物理世界交互的系统的基本要求。为解决这一问题,我们提出了SpaceDrive,一种空间感知的基于VLM的驾驶框架,它将空间信息视为显式的位置编码(PEs),而非文本数字标记,从而实现对语义和空间表示的联合推理。SpaceDrive采用一个通用的位置编码器,处理来自多视角深度估计、历史自车状态以及文本提示的所有三维坐标。这些三维PEs首先被叠加以增强相应的二维视觉标记。同时,它们作为一种任务无关的坐标表示,替代了逐数字的数值标记,既作为VLM的输入也作为输出。这一机制使模型在空间推理中能更好地索引特定的视觉语义,并直接回归轨迹坐标,而非逐数字生成,从而提高了规划准确性。大量实验验证,SpaceDrive在nuScenes数据集上实现了最先进的开放环性能,并在Bench2Drive闭环基准测试中,相对于现有的基于VLM的方法,获得了78.02的第二高驾驶分数。