Developing kernels for Processing-In-Memory (PIM) platforms poses unique challenges in data management and parallel programming on limited processing units. Although software development kits (SDKs) for PIM, such as the UPMEM SDK, provide essential tools, these emerging platforms still leave significant room for performance optimization. In this paper, we reveal surprising inefficiencies in UPMEM software stack and play with non-standard programming techniques. By making simple modifications to the assembly generated by the UPMEM compiler, we achieve speedups of 1.6-2x in integer addition and 1.4-5.9x in integer multiplication, depending on the data type. We also demonstrate that bit-serial processing of low precision data is a viable option for UPMEM: in INT4 bit-serial dot-product calculation, UPMEM can achieve over 2.7x speedup over the baseline. Minor API extensions for PIM allocation that account for the non-uniform memory access (NUMA) architecture of the server further improve the consistency and throughput of host-PIM data transfers by up to 2.9x. Finally, we show that, when the matrix is preloaded into PIM, our optimized kernels outperform a dual-socket CPU server by over 3x for INT8 generalized matrix-vector multiplication (GEMV) and by 10x for INT4 GEMV. Our optimized INT8 GEMV kernel outperforms the baseline 3.5x.
翻译:为内存处理(PIM)平台开发内核,在有限的处理单元上进行数据管理和并行编程,带来了独特的挑战。尽管针对 PIM 的软件开发工具包(SDK),例如 UPMEM SDK,提供了必要的工具,但这些新兴平台在性能优化方面仍存在显著的提升空间。本文揭示了 UPMEM 软件栈中令人惊讶的低效之处,并探索了非标准的编程技术。通过对 UPMEM 编译器生成的汇编代码进行简单修改,我们在整数加法上实现了 1.6 至 2 倍的加速,在整数乘法上实现了 1.4 至 5.9 倍的加速,具体取决于数据类型。我们还证明了,对于 UPMEM 而言,低精度数据的位串行处理是一种可行的选择:在 INT4 位串行点积计算中,UPMEM 相比基线可实现超过 2.7 倍的加速。针对 PIM 分配进行考虑服务器非统一内存访问(NUMA)架构的微小 API 扩展,进一步将主机与 PIM 之间数据传输的一致性和吞吐量提升了高达 2.9 倍。最后,我们证明,当矩阵被预加载到 PIM 中时,我们优化的内核在 INT8 通用矩阵-向量乘法(GEMV)上性能超过双路 CPU 服务器 3 倍以上,在 INT4 GEMV 上超过 10 倍。我们优化的 INT8 GEMV 内核性能超过基线 3.5 倍。