Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.
翻译:自动驾驶需要从复杂的多模态输入中生成安全可靠的轨迹。传统的模块化流水线将感知、预测和规划分离处理,而近期的端到端系统则对它们进行联合学习。视觉语言模型通过引入跨模态先验和常识推理进一步丰富了这一范式,然而当前基于VLM的规划器面临三个关键挑战:(i) 离散文本推理与连续控制之间的不匹配,(ii) 自回归思维链解码带来的高延迟,以及(iii) 低效或非因果的规划器限制了实时部署。我们提出了ColaVLA,一个统一的视觉-语言-行动框架,它将推理从文本迁移到一个统一的潜在空间,并将其与一个分层、并行的轨迹解码器相结合。认知潜在推理器通过自适应的自我选择,仅需两次VLM前向传播,便将场景理解压缩为紧凑的、面向决策的元动作嵌入。随后,分层并行规划器在单次前向传播中生成多尺度、因果一致的轨迹。这些组件共同保留了VLM的泛化能力和可解释性,同时实现了高效、准确且安全的轨迹生成。在nuScenes基准测试上的实验表明,ColaVLA在开环和闭环设置下均取得了最先进的性能,并具有良好的效率和鲁棒性。