Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.
翻译:视觉-语言-动作(VLA)模型已成为机器人感知与控制的统一范式,能够实现涌现泛化和长时程任务执行。然而,其高推理延迟严重阻碍了在动态现实环境中的部署。尽管流畅的机器人交互需要20至30 Hz的控制频率,但由于自回归解码的内存受限特性,当前VLA模型在边缘设备上通常仅能以3-5 Hz运行。现有优化方法往往需要大量重新训练或牺牲模型精度。为弥合这一差距,我们提出了ActionFlow,一个专为资源受限边缘平台设计的系统级推理框架。ActionFlow的核心是跨请求流水线策略,这是一种将VLA推理重新定义为微请求宏流水线的新型调度器。该策略在连续时间步中,智能地将内存受限的解码阶段与计算受限的预填充阶段进行批处理,以最大化硬件利用率。此外,为支持此调度,我们提出了跨请求状态打包前向算子和统一KV环形缓冲区,将碎片化的内存操作融合为高效的密集计算。实验结果表明,ActionFlow在OpenVLA-7B模型上无需重新训练即可实现2.55倍的FPS提升,从而在边缘硬件上实现实时动态操控。我们的工作发布于 https://anonymous.4open.science/r/ActionFlow-1D47。