Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP's limited compute throughput, we introduce context-aware mixed-precision quantization that allocates per-expert bitwidths (1-4 bit) based on prefill stage. The resulting MoE inference system overlaps GPU and NDP execution while minimizing cross-device movement. The evaluation on the GPU-NDP system shows that our approach achieves up to an 8.7-fold decoding throughput improvement over the state-of-the-art method, while incurring only a 0.13% average accuracy drop.
翻译:专家混合(MoE)模型通过条件计算扩展大型语言模型,但当专家权重超过GPU内存容量时,推理过程将受限于内存带宽。此时,权重必须卸载至外部存储器,而频繁读取这些权重会产生高昂且重复的数据传输开销。为解决此问题,我们采用CXL连接的近数据处理(CXL-NDP)作为卸载层级,在原地执行冷专家计算,从而将昂贵的参数移动转化为更经济的激活值移动。与以往主要基于上下文无关和被动响应的GPU-NDP系统不同,我们开发了一种上下文感知的MoE系统:利用预填充阶段的激活统计信息指导解码阶段的专家放置策略,动态将热专家固定在GPU侧的HBM中,其余专家则映射至CXL-NDP。为适应NDP有限的计算吞吐量,我们提出上下文感知的混合精度量化方法,基于预填充阶段为每个专家分配比特宽度(1-4位)。最终构建的MoE推理系统实现了GPU与NDP的协同执行,同时最小化跨设备数据传输。在GPU-NDP系统上的评估表明,我们的方法相比现有最优方法将解码吞吐量最高提升8.7倍,而平均精度损失仅为0.13%。