Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers - 专知论文

会员服务 ·

0

可约的 · 查准率/准确率 · Networking · Learning · Neural Networks ·

2023 年 5 月 30 日

Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers

翻译：暂无翻译

Davide Nadalini,Manuele Rusci,Luca Benini,Francesco Conti

from arxiv, Pre-print version submitted to Elsevier's Future Generation Computer Systems journal. For the associated open-source release, see https://github.com/pulp-platform/pulp-trainlib

Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units (MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep Neural Network (DNN) models in future TinyML applications. This paper tackles this challenge by introducing a novel reduced precision optimization technique for ODL primitives on MCU-class devices, leveraging the State-of-Art advancements in RISC-V RV32 architectures with support for vectorized 16-bit floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our approach for the Forward and Backward steps of the Back-Propagation training algorithm is composed of specialized shape transform operators and Matrix Multiplication (MM) kernels, accelerated with parallelization and loop unrolling. When evaluated on a single training step of a 2D Convolution layer, the SIMD-optimized FP16 primitives result up to 1.72$\times$ faster than the FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81 MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN for Image Classification and Keyword Spotting, respectively -- requiring 17.1 ms and 6.4 ms on the target platform to compute a training step on a single sample. Overall, our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs and outperforms by 1.6 $\times$ previous FP32 parallel implementations on a Continual Learning setup.

翻译：暂无翻译

0

相关内容

可约的

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

罗巴代数的表示和罗巴代数在operad中的应用

国家自然科学基金

0+阅读 · 2015年12月31日

双层仿生抗凝血血管支架及体内血管再生的研究

国家自然科学基金

0+阅读 · 2014年12月31日

C1型尼曼-匹克氏症轴突发育异常的病理机制

国家自然科学基金

0+阅读 · 2013年12月31日

PMMA-YAG:Ce3+纳米有机复合荧光发光材料的可控制备与光增强研究

国家自然科学基金

0+阅读 · 2012年12月31日

纳米复合结构的稀土掺杂透明玻璃陶瓷中表面等离子体共振增强的量子剪裁发光

国家自然科学基金

0+阅读 · 2012年12月31日

High-Speed Area-Efficient Hardware Architecture for the Efficient Detection of Faults in a Bit-Parallel Multiplier Utilizing the Polynomial Basis of GF(2m)

Arxiv

0+阅读 · 2023年7月18日

Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization

Arxiv

0+阅读 · 2023年7月18日

Implementation of a perception system for autonomous vehicles using a detection-segmentation network in SoC FPGA

Implementation of a perception system for autonomous vehicles using a detection-segmentation network in SoC FPGA

Arxiv

0+阅读 · 2023年7月17日

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Arxiv

14+阅读 · 2021年1月31日

Contrastive Clustering

Arxiv

31+阅读 · 2020年9月21日

VIP会员

文章信息

相关主题

查准率/准确率

Neural Networks

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

前沿人工智能趋势报告（Frontier AI Trends Report）

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

Andrej Karpathy：2025 年 LLM 年度回顾（2025 LLM Year in Review）

音退化问题：基于输入操控的鲁棒语音转换综述

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

High-Speed Area-Efficient Hardware Architecture for the Efficient Detection of Faults in a Bit-Parallel Multiplier Utilizing the Polynomial Basis of GF(2m)

Arxiv

0+阅读 · 2023年7月18日

Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization

Arxiv

0+阅读 · 2023年7月18日

Implementation of a perception system for autonomous vehicles using a detection-segmentation network in SoC FPGA

Implementation of a perception system for autonomous vehicles using a detection-segmentation network in SoC FPGA

Arxiv

0+阅读 · 2023年7月17日

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Arxiv

14+阅读 · 2021年1月31日

Contrastive Clustering

Arxiv

31+阅读 · 2020年9月21日

相关基金

罗巴代数的表示和罗巴代数在operad中的应用

国家自然科学基金

0+阅读 · 2015年12月31日

双层仿生抗凝血血管支架及体内血管再生的研究

国家自然科学基金

0+阅读 · 2014年12月31日

C1型尼曼-匹克氏症轴突发育异常的病理机制

国家自然科学基金

0+阅读 · 2013年12月31日

PMMA-YAG:Ce3+纳米有机复合荧光发光材料的可控制备与光增强研究

国家自然科学基金

0+阅读 · 2012年12月31日

纳米复合结构的稀土掺杂透明玻璃陶瓷中表面等离子体共振增强的量子剪裁发光

国家自然科学基金

0+阅读 · 2012年12月31日

微信扫码咨询专知VIP会员