Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for TMO that optimizes the inference location (i.e., local vs. cloud) and multi-modal data sources to use for each task/dialogue, aiming to maximize the long-term reward (response quality, latency, and usage cost) while adhering to resource constraints. We also contribute M4A1, a new dataset we curated that contains reward and cost metrics across multiple modality, task, dialogue, and LLM configurations, enabling evaluation of offloading decisions. We demonstrate the effectiveness of TMO compared to several exploration-decision and LLM-as-Agent baselines, showing significant improvements in latency, cost, and response quality.
翻译:与传统机器学习模型相比,近期的大型语言模型(LLMs)能够通过多轮对话和多模态数据源展现出多任务求解能力。LLMs的这些独特特性,连同其庞大的模型规模,使其部署更具挑战性。具体而言,(i)在本地设备上部署LLMs面临计算、内存和能源资源问题,而(ii)在云端部署则无法保证实时服务并产生通信/使用成本。本文设计了TMO,一个具备三M卸载(Multi-modal, Multi-task, Multi-dialogue)能力的本地-云端LLM推理系统。TMO整合了(i)一个能够高速处理简单任务的轻量级本地LLM,以及(ii)一个能够处理多模态数据源的大规模云端LLM。我们为TMO开发了一种资源约束强化学习(RCRL)策略,该策略针对每个任务/对话优化推理位置(即本地或云端)以及所使用的多模态数据源,旨在遵守资源约束的同时最大化长期奖励(响应质量、延迟和使用成本)。我们还贡献了M4A1,这是一个我们新整理的数据集,其中包含了跨多种模态、任务、对话和LLM配置的奖励与成本指标,可用于评估卸载决策的有效性。我们通过与多种探索-决策基准以及LLM-as-Agent基准进行比较,证明了TMO在延迟、成本和响应质量方面均取得显著提升。