Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.
翻译:多模态大语言模型(MLLMs)在广泛的视觉-语言任务中展现出卓越的能力。然而,它们作为具身智能体的性能——这需要多轮对话空间推理与序列动作预测——仍需进一步探索。我们的工作通过在视觉与语言导航(VLN)背景下引入一个统一且可扩展的评估框架来探究这一潜力,该框架通过将传统导航数据集整合成一个标准化基准(命名为VLN-MME),以零样本智能体的方式探测MLLMs。我们通过高度模块化且易于使用的设计简化了评估过程。这种灵活性简化了实验,使得能够在不同的MLLM架构、智能体设计和导航任务之间进行结构化比较和组件级消融分析。关键的是,在我们的框架支持下,我们观察到,通过思维链(CoT)推理和自我反思来增强基线智能体,反而导致了意外的性能下降。这表明MLLMs在具身导航任务中表现出较差的上下文感知能力;尽管它们能够遵循指令并结构化输出,但其三维空间推理的保真度较低。VLN-MME为在具身导航设置中系统评估通用MLLMs奠定了基础,并揭示了它们在序列决策能力方面的局限性。我们相信这些发现为MLLMs作为具身智能体的后训练提供了关键指导。