Computationally intensive algorithms such as Deep Neural Networks (DNNs) are becoming killer applications for edge devices. Porting heavily data-parallel algorithms on resource-constrained and battery-powered devices poses several challenges related to memory footprint, computational throughput, and energy efficiency. Low-bitwidth and mixed-precision arithmetic have been proven to be valid strategies for tackling these problems. We present Dustin, a fully programmable compute cluster integrating 16 RISC-V cores capable of 2- to 32-bit arithmetic and all possible mixed-precision permutations. In addition to a conventional Multiple-Instruction Multiple-Data (MIMD) processing paradigm, Dustin introduces a Vector Lockstep Execution Mode (VLEM) to minimize power consumption in highly data-parallel kernels. In VLEM, a single leader core fetches instructions and broadcasts them to the 15 follower cores. Clock gating Instruction Fetch (IF) stages and private caches of the follower cores leads to 38\% power reduction with minimal performance overhead (<3%). The cluster, implemented in 65 nm CMOS technology, achieves a peak performance of 58 GOPS and a peak efficiency of 1.15 TOPS/W.
翻译:深神经网络(DNN)等计算密集型算法正在成为边缘装置的致命应用。在资源限制和电池动力设备上大量输入数据平行算法,这在记忆足迹、计算吞吐量和能源效率方面构成若干挑战。低比维和混合精密算法已被证明是解决这些问题的有效战略。我们提出了Dustin,这是一个完全可编程的计算组,包括16个RISC-V核心,能够2至32位算术和所有可能的混合精度置换。除了传统的多导多导多达塔(MIMD)处理范式外,Dustin还引入了矢量锁定执行模式(VLEM),以最大限度地减少高数据隔热内核的电耗。在VLEM中,一个单一的领导核心抓取指令并将其传送到15个后续核心。ClocktationGRick(IF)级和后续核心的私人缓存导致38 ⁇ 功率下降,同时实现最低性能顶部( < 3 % ) 和GMAS 的顶峰值。