4D-VLA：跨场景校准的时空视觉-语言-动作预训练方法 (4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration)

Leveraging diverse robotic data for pretraining remains a critical challenge. Existing methods typically model the dataset's action distribution using simple observations as inputs. However, these inputs are often incomplete, resulting in a dispersed conditional action distribution-an issue we refer to as coordinate system chaos and state chaos. This inconsistency significantly hampers pretraining efficiency. To address this, we propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to mitigate these sources of chaos. Our model introduces depth and temporal information into visual features with sequential RGB-D inputs, aligning the coordinate systems of the robot and the scene. This alignment endows the model with strong spatiotemporal reasoning capabilities while minimizing training overhead. Additionally, we introduce memory bank sampling, a frame sampling strategy designed to extract informative frames from historical images, further improving effectiveness and efficiency. Experimental results demonstrate that our pretraining method and architectural components substantially enhance model performance. In both simulated and real-world experiments, our model achieves a significant increase in success rate over OpenVLA. To further assess spatial perception and generalization to novel views, we introduce MV-Bench, a multi-view simulation benchmark. Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.

翻译：利用多样化的机器人数据进行预训练仍是一个关键挑战。现有方法通常以简单观测作为输入来建模数据集的动作分布。然而，这些输入往往不完整，导致条件动作分布分散——这一问题我们称为坐标系混乱与状态混乱。这种不一致性严重阻碍了预训练效率。为解决此问题，我们提出4D-VLA，一种通过有效整合四维信息到输入中以缓解混乱源的新方法。我们的模型通过序列化RGB-D输入将深度与时间信息融入视觉特征，实现机器人坐标系与场景坐标系的对齐。该对齐机制赋予模型强大的时空推理能力，同时最小化训练开销。此外，我们提出记忆库采样策略，这是一种从历史图像中提取信息帧的帧采样方法，进一步提升了效能与效率。实验结果表明，我们的预训练方法与架构组件显著增强了模型性能。在仿真与真实世界实验中，我们的模型相比OpenVLA实现了成功率的大幅提升。为深入评估空间感知能力及对新视角的泛化性能，我们提出了MV-Bench——一个多视角仿真基准测试平台。我们的模型在所有测试中均优于现有方法，展现出更强的空间理解能力与适应性。