ProactivePIM：基于内存内处理加速权重共享嵌入层以实现可扩展推荐系统 (ProactivePIM: Accelerating Weight-Sharing Embedding Layer with PIM for Scalable Recommendation System)

Although deep learning-based personalized recommendation systems provide qualified recommendations, they strain data center resources. The main bottleneck is the embedding layer, which is highly memory-intensive due to its sparse, irregular access patterns to embeddings. Recent near-memory processing (NMP) and processing-in-memory (PIM) architectures have addressed these issues by exploiting parallelism within memory. However, as model sizes increase year by year and can exceed server capacity, inference on single-node servers becomes challenging, necessitating the integration of model compression. Various algorithms have been proposed for model size reduction, but they come at the cost of increased memory access and CPU-PIM communication. We present ProactivePIM, a PIM system tailored for weight-sharing algorithms, a family of compression methods that decompose an embedding table into compact subtables, such as QR-trick and TT-Rec. Our analysis shows that embedding layer execution with weight-sharing algorithms increases memory access and incurs CPU-PIM communication. We also find that these algorithms exhibit unique data locality characteristics, which we name intra-GnR locality. ProactivePIM accelerates weight-sharing algorithms by utilizing a heterogeneous HBM-DIMM memory architecture with integration of a two-level PIM system of base-die PIM (bd-PIM) and bank-group PIM (bg-PIM) inside the HBM. To gain further speedup, ProactivePIM prefetches embeddings with high intra-GnR locality into an SRAM cache within bg-PIM and eliminates the CPU-PIM communication through duplication of target subtables across bank groups. With additional optimization techniques, our design effectively accelerates weight-sharing algorithms, achieving 2.22x and 2.15x speedup in QR-trick and TT-Rec, respectively, compared to the baseline architecture.

翻译：尽管基于深度学习的个性化推荐系统能够提供高质量的推荐，但其对数据中心资源造成了巨大压力。主要瓶颈在于嵌入层，由于其对嵌入向量具有稀疏且不规则的访问模式，导致该层具有极高的内存需求。近期的近内存处理（NMP）和内存内处理（PIM）架构通过利用内存内部的并行性，已在一定程度上解决了这些问题。然而，随着模型规模逐年增长并可能超出单服务器容量，在单节点服务器上进行推理变得愈发困难，因此需要结合模型压缩技术。虽然已有多种算法被提出以减小模型规模，但它们通常以增加内存访问和CPU-PIM通信开销为代价。本文提出ProactivePIM，这是一个专为权重共享算法（一类将嵌入表分解为紧凑子表的压缩方法，例如QR-trick和TT-Rec）设计的PIM系统。我们的分析表明，采用权重共享算法执行嵌入层会增加内存访问并引发CPU-PIM通信。我们还发现这些算法展现出独特的数据局部性特征，我们将其命名为组内重复（intra-GnR）局部性。ProactivePIM通过采用异构的HBM-DIMM内存架构，并在HBM内部集成包含基础芯片PIM（bd-PIM）和存储体组PIM（bg-PIM）的两级PIM系统，来加速权重共享算法。为进一步提升速度，ProactivePIM将具有高组内重复局部性的嵌入向量预取到bg-PIM内部的SRAM缓存中，并通过在多个存储体组间复制目标子表来消除CPU-PIM通信。结合额外的优化技术，我们的设计有效加速了权重共享算法，与基线架构相比，在QR-trick和TT-Rec上分别实现了2.22倍和2.15倍的加速。