Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.
翻译:推荐系统通过提供个性化预测创造了显著的经济效益。生成式推荐(GR)整合大语言模型以增强对长用户-物品序列的理解能力。尽管采用基于注意力机制的架构,GR的工作负载与大语言模型服务存在显著差异。GR通常处理长提示词而生成简短、固定长度的输出,但由于波束宽度较大,每个解码阶段的计算成本尤为高昂。此外,由于波束搜索涉及巨大的物品空间,排序开销变得特别耗时。本文提出xGR——一个面向GR的服务系统,能在高并发场景下满足严格的低延迟需求。首先,xGR通过分阶段计算与分离的KV缓存统一了预填充和解码阶段的处理流程。其次,xGR实现了基于数据结构和用的早期排序终止与掩码物品过滤机制。最后,xGR重构了整体流水线以利用多级重叠与多流并行技术。基于真实推荐服务数据集的实验表明,在严格延迟约束下,xGR相比最先进的基线系统实现了至少3.49倍的吞吐量提升。