ActionFlow：面向边缘视觉语言模型的流水线化动作加速框架 (ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge)

Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.

翻译：视觉-语言-动作（VLA）模型已成为机器人感知与控制的统一范式，能够实现涌现泛化和长时程任务执行。然而，其高推理延迟严重阻碍了在动态现实环境中的部署。尽管流畅的机器人交互需要20至30 Hz的控制频率，但由于自回归解码的内存受限特性，当前VLA模型在边缘设备上通常仅能以3-5 Hz运行。现有优化方法往往需要大量重新训练或牺牲模型精度。为弥合这一差距，我们提出了ActionFlow，一个专为资源受限边缘平台设计的系统级推理框架。ActionFlow的核心是跨请求流水线策略，这是一种将VLA推理重新定义为微请求宏流水线的新型调度器。该策略在连续时间步中，智能地将内存受限的解码阶段与计算受限的预填充阶段进行批处理，以最大化硬件利用率。此外，为支持此调度，我们提出了跨请求状态打包前向算子和统一KV环形缓冲区，将碎片化的内存操作融合为高效的密集计算。实验结果表明，ActionFlow在OpenVLA-7B模型上无需重新训练即可实现2.55倍的FPS提升，从而在边缘硬件上实现实时动态操控。我们的工作发布于 https://anonymous.4open.science/r/ActionFlow-1D47。