Computation in several real-world applications like probabilistic machine learning, sparse linear algebra, and robotic navigation, can be modeled as irregular directed acyclic graphs (DAGs). The irregular data dependencies in DAGs pose challenges to parallel execution on general-purpose CPUs and GPUs, resulting in severe under-utilization of the hardware. This paper proposes DPU, a specialized processor designed for the efficient execution of irregular DAGs. The DPU is equipped with parallel compute units that execute different subgraphs of a DAG independently. The compute units can synchronize within a cycle using a hardware-supported synchronization primitive, and communicate via an efficient interconnect to a global banked scratchpad. Furthermore, a precision-scalable posit arithmetic unit is developed to enable application-dependent precision. The DPU is taped-out in 28nm CMOS, achieving a speedup of 5.1$\times$ and 20.6$\times$ over state-of-the-art CPU and GPU implementations on DAGs of sparse linear algebra and probabilistic machine learning workloads. This performance is achieved while operating at a power budget of 0.23W, as opposed to 55W and 98W of the CPU and GPU, resulting in a peak efficiency of 538 GOPS/W with DPU, which is 1350$\times$ and 9000$\times$ higher than the CPU and GPU, respectively. Thus, with specialized architecture, DPU enables low-power execution of irregular DAG workloads.
翻译:在几个现实世界应用中,如概率机器学习、线性代数稀少和机器人导航等的计算方法,可模拟成非常规定向周期性同步图(DAGs)。DAG中不规则的数据依赖性对通用CPU和GPU的平行执行提出了挑战,导致硬件严重利用不足。本文件提议DPU,这是为高效执行非常规DAG而设计的专用处理器。DPU配有平行计算器,独立地执行DAG的不同子谱。计算器可以在周期内同步使用硬件支持的同步同步原始,并通过与全球银行抓图的有效互连进行沟通。此外,开发了一个精确可缩放的计算器,使应用程序的精确精确精确度能够使应用性精确。DPUPU在28n\CMOS中被删除,使DPUPS的速率达到515美元和20.6美元,使DPUP和GPUP在低线性平面的直线性平面平面平面的平面平面平面平面平面平面平面平面平面平面平面平面平面的平面平面、直径,使GWSLI在98平面上运行中实现了运行。