Atom：通过模块化复用实现高效的端侧视频-语言处理流程 (Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse)

Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.

翻译：近年来，视频-语言模型的进展使得视频检索、描述生成与内容汇编等强大应用成为可能。然而，由于冗余的模型加载与碎片化的执行过程，在移动设备上高效运行此类多阶段处理流程仍具挑战。本文提出Atom，一种端侧系统，通过重构视频-语言处理流程以实现快速高效的执行。Atom将十亿参数规模的模型分解为可复用的模块（如视觉编码器与语言解码器），并在字幕生成、推理与索引等子任务间复用这些模块。这种以复用为核心的设计消除了重复的模型加载，并支持并行执行，从而在不牺牲性能的前提下降低了端到端延迟。在商用智能手机上，Atom相比无复用基准方法实现了27–33%的执行速度提升，性能损失极小（检索任务中Recall@1下降≤2.3，字幕生成任务中CIDEr下降≤1.5）。这些结果表明Atom是一种实用、可扩展的方法，能够为边缘设备提供高效的视频-语言理解能力。