Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input-dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1) a Similar Prompts Searching (SPS) algorithm to predict expert activation patterns based on semantic similarity of inputs; (2) a Main Model Pre-allocation (MMP) algorithm to ensure service-level objectives (SLOs) via worst-case memory estimation; and (3) a joint memory and replica optimization framework leveraging Lagrangian duality and the Longest Processing Time (LPT) algorithm. We implement Remoe on Kubernetes and evaluate it across multiple LLM benchmarks. Experimental results show that Remoe reduces inference cost by up to 57% and cold start latency by 47% compared to state-of-the-art baselines.
翻译:混合专家模型(Mixture-of-Experts, MoE)因其能够通过稀疏专家激活扩展模型容量,已成为大规模语言模型(LLMs)的主流架构。与此同时,无服务器计算凭借其弹性伸缩和按使用量计费的特性,非常适合部署具有突发工作负载的MoE模型。然而,MoE模型中庞大的专家数量因内存密集型的参数缓存而导致高昂的推理成本。由于专家激活依赖于输入特性,简单的模型划分难以有效降低这些成本。为解决上述问题,我们提出了Remoe——一个专为无服务器计算设计的异构MoE推理系统。Remoe将非专家模块分配至GPU,专家模块分配至CPU,并进一步将低频激活的专家卸载至独立的无服务器函数中,以降低内存开销并实现并行执行。我们引入了三项关键技术:(1)相似提示搜索算法,通过输入语义相似性预测专家激活模式;(2)主模型预分配算法,通过最坏情况内存估计确保服务级别目标;(3)基于拉格朗日对偶与最长处理时间算法的联合内存与副本优化框架。我们在Kubernetes上实现了Remoe,并在多个LLM基准测试中进行了评估。实验结果表明,相较于最先进的基线方法,Remoe最高可降低57%的推理成本与47%的冷启动延迟。