We present OpenDriveVLA, a Vision Language Action model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially grounded driving actions by leveraging multimodal inputs, including 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent environment ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question answering tasks. Qualitative analyses further illustrate its capability to follow high-level driving commands and generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.
翻译:本文提出OpenDriveVLA,一种专为端到端自动驾驶设计的视觉语言动作模型,构建于开源大型语言模型之上。OpenDriveVLA通过融合多模态输入(包括2D和3D实例感知视觉表征、自车状态及语言指令)生成空间锚定的驾驶动作。为弥合驾驶视觉表征与语言嵌入之间的模态鸿沟,我们引入了一种层次化视觉语言对齐流程,将2D和3D结构化视觉标记投影至统一语义空间。此外,我们将结构化智能体-环境-自车交互建模整合到自回归解码过程中,使模型能够捕获对可靠轨迹规划至关重要的细粒度空间依赖性与行为感知动态特性。在nuScenes数据集上的大量实验表明,OpenDriveVLA在开环轨迹规划与驾驶相关问答任务中均取得了最先进的性能。定性分析进一步揭示了该模型遵循高层驾驶指令并在复杂场景下生成轨迹的能力,凸显了其在下一代端到端自动驾驶领域的应用潜力。