Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is $27\times$ faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and $1.34\times$ faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is $2.8\times$ and $3.2\times$ than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems.
翻译:培训机器(ML)算法是一个计算密集的过程,由于多次访问大型培训数据集,通常会以记忆为主,因此,由于多次访问大型培训数据集,这种过程往往会以记忆为主。因此,在存储单位和处理单位之间,存储单位和处理单位之间的数据移动费用昂贵,消耗大量能源和执行周期。内存中心计算系统,即处理中存储(PIM)能力,可以缓解这一数据移动瓶颈。我们的目标是了解现代通用PIM结构的潜力,以加速ML培训。为此,我们(1)在实时通用的存储单位和处理单位之间,实施若干典型的ML(即CPU、GPPU、GPPP)计算系统(即PIM),对实时的存储中心计算系统的评价可以超过2500个PIM核心核心。对于通用的PIM结构,对于用于存储中存储的MIM值、物流回归、决定树组(即实时的运行)的运行速度超过IM。