Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., computing systems with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate machine learning training. To do so, we (1) implement several representative classic machine learning algorithms (namely, linear regression, logistic regression, decision tree, K-means clustering) on a real-world general-purpose PIM architecture, (2) characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our experimental evaluation on a memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound machine learning workloads, when the necessary operations and datatypes are natively supported by PIM hardware. To our knowledge, our work is the first one to evaluate training of machine learning algorithms on a real-world general-purpose PIM architecture.
翻译:计算机培训机算法是一个计算密集的过程,由于反复访问大型培训数据集,这种过程往往记忆紧张,因此,由于多次访问大型培训数据集,处理器核心系统(如CPU、GPU)在存储器和处理器之间发生费用昂贵的数据移动,消耗大量能源和执行周期。内存中心计算系统,即具有处理模拟(PIM)能力的计算机系统,可以缓解这一数据移动瓶颈。我们的目标是了解现代通用PIM结构的潜力,以加速机器学习培训。为此,我们(1)在现实世界通用PIM结构中,采用若干具有代表性的经典机器学习算法(即线性回归、物流回归、决策树、K- means 集群),这种算法耗费大量能量和执行周期。内存中心计算系统,即具有处理模拟(PIM)能力的计算系统,可以比照其在CPU和GPPU的对应执行系统。我们对2500多个PIM核心的存储器系统的实验性评价表明,通用PIM结构可以大大加速存储机级学习工作量,而必要的通用的PIM操作和一般IM的机级结构是我们基本的硬件学习的硬件。