DARKSIDE: 适用于极端边缘芯片的异构 RISC-V 计算集群用于深度神经网络的推理和训练 (DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training) - 专知论文

会员服务 ·

0

异构 · 深度神经网络 · 芯片 · 边缘 · 灵活性 ·

2023 年 3 月 31 日

DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

翻译：DARKSIDE: 适用于极端边缘芯片的异构 RISC-V 计算集群用于深度神经网络的推理和训练

Angelo Garofalo,Yvan Tortorella,Matteo Perotti,Luca Valente,Alessandro Nadalini,Luca Benini,Davide Rossi,Francesco Conti

from arxiv, 11 pages, 15 figures

On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16-b floating point Tensor Product Engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency - enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference.

翻译：在极端边缘（TinyML）进行芯片上深度神经网络推理和训练会对延迟、吞吐量、精度和灵活性提出严格要求。异构集群是满足挑战的有希望的解决方案，将带有 DSP 增强内核的灵活性与专用加速器的性能和能量提升相结合。我们提出 DARKSIDE，一个系统级芯片，具有一个异构集群，包含 8 个 RISC-V 内核，这些内核带有 2-b 至 32-b 混合精度整数算术。为了在关键的计算密集型深度神经网络（DNN）内核上提高性能和效率，集群增加了三个数字加速器：一个专门用于低数据复用深度卷积内核的引擎（每个周期多达 30 MAC）；一个最小开销数据传输器，用于在飞行中编排 1-b 至 32-b 数据；一个用于平铺矩阵乘法加速的 16-b 浮点张量乘积引擎（TPE）。DARKSIDE 的实现采用 65nm CMOS 技术。当处理 2-b 整数 DNN 内核时，该集群可以达到 65 GOPS 的峰值整数性能和 835 GOPS/W 的峰值效率。当针对浮点张量运算时，TPE 提供高达 18.2 GFLOPS 的性能或 300 GFLOPS/W 的效率 - 足以在竞争速度下实现芯片内浮点训练，同时具备超低功耗定量推理。

0

相关内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

专知会员服务

43+阅读 · 2020年7月19日

可解释高效异构图卷积网络，Interpretable and Efficient Heterogeneous Graph Convolutional Network

可解释高效异构图卷积网络，Interpretable and Efficient Heterogeneous Graph Convolutional Network

专知会员服务

63+阅读 · 2020年7月12日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【Nature论文】用于理解图像分类决策和改进神经网络鲁棒性的对抗性解释（Adversarial Explanations for Understanding Image Classiﬁcation Decisions and Improved Neural Network Robustness ）

【Nature论文】用于理解图像分类决策和改进神经网络鲁棒性的对抗性解释（Adversarial Explanations for Understanding Image Classiﬁcation Decisions and Improved Neural Network Robustness ）

专知会员服务

13+阅读 · 2019年11月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

NeurIPS'22上的GNN好文集合 (表示能力、架构设计、图对比/自监督学习、分布偏移、可解释、推荐系统等)

NeurIPS'22上的GNN好文集合 (表示能力、架构设计、图对比/自监督学习、分布偏移、可解释、推荐系统等)

图与推荐

3+阅读 · 2022年9月20日

使用 Keras Tuner 调节超参数

使用 Keras Tuner 调节超参数

TensorFlow

15+阅读 · 2020年2月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

斯坦福大学Fall 2018课程-机器学习硬件加速器( 附PPT下载)

斯坦福大学Fall 2018课程-机器学习硬件加速器( 附PPT下载)

专知

18+阅读 · 2018年7月15日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

基于GPU的脉冲星宽带观测的相干消色散研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向通用计算集群的全局GPU虚拟化理论与方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

利用GPU实现大规模复杂体系反应分子动力学模拟的方法

国家自然科学基金

2+阅读 · 2012年12月31日

非饱和颗粒材料水力-力学耦合过程两尺度分析的二阶计算均匀化方法

国家自然科学基金

0+阅读 · 2012年12月31日

CPU Cache的功耗驱动设计方法及工具研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向众核体系结构的操作系统并行优化关键技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

天体测量中的高并行图像处理方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

面向NBTI的SOC芯片可靠性设计关键技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于人工神经网络的结核病疫情预测研究及软件实现

国家自然科学基金

1+阅读 · 2008年12月31日

移动终端有限资源下的三维图形处理方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

An Experimental Investigation of Tuning QUIC-Based Publish-Subscribe Architectures in IoT

Arxiv

0+阅读 · 2023年5月19日

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Arxiv

0+阅读 · 2023年5月19日

Scaling Up Dynamic Graph Representation Learning via Spiking Neural Networks

Arxiv

0+阅读 · 2023年5月18日

Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

Arxiv

0+阅读 · 2023年5月18日

Less Can Be More: Unsupervised Graph Pruning for Large-scale Dynamic Graphs

Arxiv

0+阅读 · 2023年5月18日

Training Graph Neural Networks with 1000 Layers

Arxiv

13+阅读 · 2021年6月14日

Graph Neural Networks with Heterophily

Arxiv

19+阅读 · 2021年2月4日

Directional Graph Networks

Directional Graph Networks

Arxiv

27+阅读 · 2020年12月10日

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Arxiv

12+阅读 · 2020年6月24日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

VIP会员

文章信息

相关主题

深度神经网络

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

专知会员服务

43+阅读 · 2020年7月19日

可解释高效异构图卷积网络，Interpretable and Efficient Heterogeneous Graph Convolutional Network

可解释高效异构图卷积网络，Interpretable and Efficient Heterogeneous Graph Convolutional Network

专知会员服务

63+阅读 · 2020年7月12日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【Nature论文】用于理解图像分类决策和改进神经网络鲁棒性的对抗性解释（Adversarial Explanations for Understanding Image Classiﬁcation Decisions and Improved Neural Network Robustness ）

【Nature论文】用于理解图像分类决策和改进神经网络鲁棒性的对抗性解释（Adversarial Explanations for Understanding Image Classiﬁcation Decisions and Improved Neural Network Robustness ）

专知会员服务

13+阅读 · 2019年11月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

赋能真实世界：基于大语言模型的产业智能体技术、实践与评测综述

军事行动中人工智能系统目标交战的附带损伤评估模型 | 最新文献

【普林斯顿博士论文】面向人本机器人学的安全与学习博弈论融合

美陆军协会（AUSA）2025 年会公布的美国十大武器与防务产品创新

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

NeurIPS'22上的GNN好文集合 (表示能力、架构设计、图对比/自监督学习、分布偏移、可解释、推荐系统等)

NeurIPS'22上的GNN好文集合 (表示能力、架构设计、图对比/自监督学习、分布偏移、可解释、推荐系统等)

图与推荐

3+阅读 · 2022年9月20日

使用 Keras Tuner 调节超参数

使用 Keras Tuner 调节超参数

TensorFlow

15+阅读 · 2020年2月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

斯坦福大学Fall 2018课程-机器学习硬件加速器( 附PPT下载)

斯坦福大学Fall 2018课程-机器学习硬件加速器( 附PPT下载)

专知

18+阅读 · 2018年7月15日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

An Experimental Investigation of Tuning QUIC-Based Publish-Subscribe Architectures in IoT

Arxiv

0+阅读 · 2023年5月19日

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Arxiv

0+阅读 · 2023年5月19日

Scaling Up Dynamic Graph Representation Learning via Spiking Neural Networks

Arxiv

0+阅读 · 2023年5月18日

Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

Arxiv

0+阅读 · 2023年5月18日

Less Can Be More: Unsupervised Graph Pruning for Large-scale Dynamic Graphs

Arxiv

0+阅读 · 2023年5月18日

Training Graph Neural Networks with 1000 Layers

Arxiv

13+阅读 · 2021年6月14日

Graph Neural Networks with Heterophily

Arxiv

19+阅读 · 2021年2月4日

Directional Graph Networks

Directional Graph Networks

Arxiv

27+阅读 · 2020年12月10日

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Arxiv

12+阅读 · 2020年6月24日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

相关基金

基于GPU的脉冲星宽带观测的相干消色散研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向通用计算集群的全局GPU虚拟化理论与方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

利用GPU实现大规模复杂体系反应分子动力学模拟的方法

国家自然科学基金

2+阅读 · 2012年12月31日

非饱和颗粒材料水力-力学耦合过程两尺度分析的二阶计算均匀化方法

国家自然科学基金

0+阅读 · 2012年12月31日

CPU Cache的功耗驱动设计方法及工具研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向众核体系结构的操作系统并行优化关键技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

天体测量中的高并行图像处理方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

面向NBTI的SOC芯片可靠性设计关键技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于人工神经网络的结核病疫情预测研究及软件实现

国家自然科学基金

1+阅读 · 2008年12月31日

移动终端有限资源下的三维图形处理方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员