Embodied4C：面向具身视觉语言导航的关键能力评估 (Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation)

Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.

翻译：视觉语言导航要求智能体在具身约束下进行推理与行动。尽管视觉语言模型展现出强大的泛化能力，但现有基准测试对具身性——即物理平台选择、传感器配置与模态对齐方式——如何影响感知、推理与控制的理解仍显不足。本文提出Embodied4C，一个作为具身推理图灵测试设计的闭环基准。该基准通过约1,100道单次推理问题与58项目标导向导航任务，在三种异构具身平台——自动驾驶车辆、空中无人机与机械臂——上系统评估视觉语言模型的核心具身能力。这些任务共同考察四个基础维度：语义推理、空间推理、时间推理与物理推理。每个具身平台均采用动态传感器配置与环境变量，以探究超越平台特定适应的泛化能力。为防止具身过拟合，Embodied4C整合了针对抽象与跨语境推理的领域远端查询。对十种前沿视觉语言模型与四种具身控制基线的综合评估表明：跨模态对齐与指令调优比模型规模更为关键，而空间与时间推理仍是实现可靠具身能力的主要瓶颈。