We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.
翻译:我们提出了一种用于自动驾驶的两阶段视觉语言问答系统,该系统能够回答高层级的感知、预测与规划问题。在第一阶段,一个大型多模态大语言模型(Qwen2.5-VL-32B)以六路摄像头输入、短时历史窗口以及包含少量示例的思维链提示为条件。通过自洽性集成(对多个采样的推理链进行整合)进一步提升了答案的可靠性。在第二阶段,我们通过nuScenes场景元数据(物体标注、自车状态等)和针对特定问题类别的指令(分别为感知、预测、规划任务设计独立的提示)对提示进行了增强。在驾驶问答基准测试中,我们的方法显著优于基线Qwen2.5模型。例如,在第一阶段使用5个历史帧和10样本提示可获得65.1%的整体准确率(零样本条件下为62.61%);应用自洽性集成后,该指标提升至66.85%。第二阶段实现了67.37%的整体准确率。值得注意的是,在严重的视觉干扰下,该系统仍能保持96%的准确率。这些结果表明,精心设计的提示与上下文锚定能够极大地增强基于预训练视觉语言模型的高层级驾驶问答性能。