On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16-b floating point Tensor Product Engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency - enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference.
翻译:在极端边缘(TinyML)进行芯片上深度神经网络推理和训练会对延迟、吞吐量、精度和灵活性提出严格要求。异构集群是满足挑战的有希望的解决方案,将带有 DSP 增强内核的灵活性与专用加速器的性能和能量提升相结合。我们提出 DARKSIDE,一个系统级芯片,具有一个异构集群,包含 8 个 RISC-V 内核,这些内核带有 2-b 至 32-b 混合精度整数算术。为了在关键的计算密集型深度神经网络(DNN)内核上提高性能和效率,集群增加了三个数字加速器:一个专门用于低数据复用深度卷积内核的引擎(每个周期多达 30 MAC);一个最小开销数据传输器,用于在飞行中编排 1-b 至 32-b 数据;一个用于平铺矩阵乘法加速的 16-b 浮点张量乘积引擎(TPE)。DARKSIDE 的实现采用 65nm CMOS 技术。当处理 2-b 整数 DNN 内核时,该集群可以达到 65 GOPS 的峰值整数性能和 835 GOPS/W 的峰值效率。当针对浮点张量运算时,TPE 提供高达 18.2 GFLOPS 的性能或 300 GFLOPS/W 的效率 - 足以在竞争速度下实现芯片内浮点训练,同时具备超低功耗定量推理。