Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a "quality cliff", offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained "sub-experts." This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, preserving functional locality without requiring retraining. Second, an \emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9\% under a strict latency budget or reduce latency by up to 10.36\% under limited resources. MoE-Prism provides the critical "control knob" to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.
翻译:混合专家(Mixture-of-Experts, MoE)模型作为大规模人工智能的先进技术,通过稀疏激活参数实现了高质量输出。然而,其依赖基于top-k机制在少数单体专家间进行路由的特性导致了“质量悬崖”现象,仅能提供少数粗粒度的操作点。这种僵化性迫使系统在成本与质量之间做出艰难权衡,无法适应多样化的服务水平目标(Service Level Objectives, SLOs),并造成显著的资源过度配置。本文提出MoE-Prism,一种通过模型-系统协同设计将刚性MoE模型转化为弹性服务的方法。我们的方法分为两个阶段。首先,一个离线重构引擎系统地将单体专家解耦为细粒度的“子专家”。该引擎采用基于元启发式的分区优化求解器对神经元进行分组,在无需重新训练的前提下保持功能局部性。其次,一个在线调度引擎通过服务质量感知调度利用这种新型弹性。它实施专门策略以解决复杂的系统问题,包括在云部署中最大化吞吐量,以及对内存受限设备进行延迟优化的卸载管理。我们在三种不同MoE模型上的评估表明,MoE-Prism相比基线能提供超过4倍数量且稳定的独立操作点。这使得AI服务能够在严格延迟约束下动态提升吞吐量最高达19.9%,或在有限资源条件下降低延迟最高达10.36%。MoE-Prism提供了关键的“调控旋钮”以弥合模型与系统间的鸿沟,为实现下一代自适应、高效且服务质量感知的AI服务奠定基础。