Multimodal large language models (MLLMs) extend LLMs with visual understanding through a three-stage pipeline: multimodal preprocessing, vision encoding, and LLM inference. While these stages enhance capability, they introduce significant system bottlenecks. First, multimodal preprocessing-especially video decoding-often dominates Time-to-First-Token (TTFT). Most systems rely on CPU-based decoding, which severely limits throughput, while existing GPU-based approaches prioritize throughput-oriented parallelism and fail to meet the latency-sensitive requirements of MLLM inference. Second, the vision encoder is a standalone, compute-intensive stage that produces visual embeddings and cannot be co-batched with LLM prefill or decoding. This heterogeneity forces inter-stage blocking and increases token-generation latency. Even when deployed on separate GPUs, these stages underutilize available compute and memory resources, reducing overall utilization and constraining system throughput. To address these challenges, we present FlashCodec and UnifiedServe, two complementary designs that jointly optimize the end-to-end MLLM pipeline. FlashCodec accelerates the multimodal preprocessing stage through collaborative multi-GPU video decoding, reducing decoding latency while preserving high throughput. UnifiedServe optimizes the vision-to-text and inference stages using a logically decoupled their execution to eliminate inter-stage blocking, yet physically sharing GPU resources to maximize GPU system utilization. By carefully orchestrating execution across stages and minimizing interference, UnifiedServe Together, our proposed framework forms an end-to-end optimized stack that can serve up to 3.0$\times$ more requests or enforce 1.5$\times$ tighter SLOs, while achieving up to 4.4$\times$ higher throughput compared to state-of-the-art systems.
翻译:多模态大语言模型通过三阶段流水线扩展了LLM的视觉理解能力:多模态预处理、视觉编码和LLM推理。尽管这些阶段增强了模型能力,但也引入了显著的系统瓶颈。首先,多模态预处理——尤其是视频解码——常常主导首令牌生成时间。现有系统大多依赖基于CPU的解码方案,严重限制了吞吐量;而现有的基于GPU的方法则优先考虑面向吞吐量的并行化,无法满足MLLM推理对延迟敏感的需求。其次,视觉编码器作为独立的计算密集型阶段,负责生成视觉嵌入且无法与LLM预填充或解码阶段进行协同批处理。这种异构性导致阶段间阻塞并增加了令牌生成延迟。即使将这些阶段部署在独立的GPU上,仍会导致计算与内存资源利用率不足,降低整体利用率并制约系统吞吐量。为解决这些挑战,我们提出了FlashCodec与UnifiedServe这两个互补的设计方案,共同优化端到端MLLM流水线。FlashCodec通过协作式多GPU视频解码加速多模态预处理阶段,在保持高吞吐量的同时降低解码延迟。UnifiedServe通过逻辑解耦但物理共享GPU资源的方式优化视觉到文本及推理阶段:在逻辑上分离各阶段执行以消除阶段间阻塞,在物理上共享GPU资源以实现系统利用率最大化。通过精细编排跨阶段执行流程并最小化干扰,UnifiedServe与FlashCodec共同构成了端到端优化框架。实验表明,相较于现有最优系统,该框架可支持高达3.0倍的请求处理量或满足1.5倍更严格的SLO要求,同时实现高达4.4倍的吞吐量提升。