Vision-Language-Action (VLA) models have achieved remarkable breakthroughs in robotics, with the action chunk playing a dominant role in these advances. Given the real-time and continuous nature of robotic motion control, the strategies for fusing a queue of successive action chunks have a profound impact on the overall performance of VLA models. Existing methods suffer from jitter, stalling, or even pauses in robotic action execution, which not only limits the achievable execution speed but also reduces the overall success rate of task completion. This paper introduces VLA-RAIL (A Real-Time Asynchronous Inference Linker), a novel framework designed to address these issues by conducting model inference and robot motion control asynchronously and guaranteeing smooth, continuous, and high-speed action execution. The core contributions of the paper are two fold: a Trajectory Smoother that effectively filters out the noise and jitter in the trajectory of one action chunk using polynomial fitting and a Chunk Fuser that seamlessly align the current executing trajectory and the newly arrived chunk, ensuring position, velocity, and acceleration continuity between two successive action chunks. We validate the effectiveness of VLA-RAIL on a benchmark of dynamic simulation tasks and several real-world manipulation tasks. Experimental results demonstrate that VLA-RAIL significantly reduces motion jitter, enhances execution speed, and improves task success rates, which will become a key infrastructure for the large-scale deployment of VLA models.
翻译:视觉-语言-动作(VLA)模型在机器人领域取得了显著突破,其中动作片段在这些进展中起着主导作用。鉴于机器人运动控制的实时性与连续性,融合一系列连续动作片段的策略对VLA模型的整体性能具有深远影响。现有方法存在机器人动作执行时的抖动、停滞甚至暂停等问题,这不仅限制了可达到的执行速度,也降低了任务完成的总体成功率。本文提出VLA-RAIL(一种实时异步推理链接器),这是一个新颖的框架,旨在通过异步执行模型推理与机器人运动控制来解决这些问题,并保障平滑、连续且高速的动作执行。本文的核心贡献包括两方面:一是轨迹平滑器,它利用多项式拟合有效滤除单个动作片段轨迹中的噪声与抖动;二是片段融合器,它能无缝对齐当前执行轨迹与新到达的动作片段,确保两个连续动作片段间的位置、速度与加速度连续性。我们在动态仿真任务基准以及多项真实世界操作任务上验证了VLA-RAIL的有效性。实验结果表明,VLA-RAIL显著减少了运动抖动,提升了执行速度,并提高了任务成功率,这将为VLA模型的大规模部署提供关键基础设施。