DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference tasks are memory bandwidth bound, contrary to common assumptions: (1) strict query latency constraints force small-batch operation, which limits reuse and increases bandwidth demands; and (2) large and colocated models require reading the large weight matrices from main memory, again requiring high bandwidth without offering reuse opportunities. We demonstrate the large potential of accelerating these small-batch GEMMs with processing in the main CPU memory. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU that would otherwise destroy locality. Our evaluation of StepStone variants at the channel, device, and within-device PIM levels, along with optimizations that balance parallelism benefits with data-distribution overheads demonstrate $12\times$ better minimum latency than a CPU and $2.8\times$ greater throughput for strict query latency constraints. End-to-end performance analysis of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU (by up to $16\times$) and prior main-memory acceleration approaches (by up to $2.4\times$ compared to the best prior approach).
翻译:DL 推断查询在多种互联网服务中起着重要作用,并且大量的数据中心周期用于处理 DL 推断查询。 具体地说, 完全连接的 MLP 层的矩阵- 矩阵倍增( GEMM) 操作在多个推断任务中占主导地位。 我们发现, GEMM 用于数据中心中心 DL 推断任务的GEM 操作是内存带带宽的操作, 与通常的假设相反:(1) 严格的查询延迟限制迫使小批量操作,这限制了再利用和增加带宽需求; (2) 大型和合用模式需要从主记忆读取大型重力矩阵,再次需要高带宽而不提供再利用机会。 我们展示了在主要 CPP 记忆中处理的这些小批量 GEMM 倍增(GEMM 倍增倍增) 操作的极大潜力。 我们开发了全新的 GEMM 执行流和相应的内位生成逻辑, 利用了GEMM 位置, 使得长的 PIM 内流能够摧毁地方的复杂地址测量功能; 我们对频道、 设备、 和内部PIM 级的S 等语言变换方法的评估, 显示比高级 水平更精确的平时前 和高级的平局 显示 最优的进度 的进度 度 度 度 度 度 度 度 度 度 度 度 度 度