快捷键效果: 从 Tensorflow 到基于 FPGA 的加速器, 用于快捷键数据, 配有可再利用的注意到内存配置 (ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data) - 专知论文

会员服务 ·

0

可约的 · EfficientNet · 残差块 · ONCE · DNN ·

2021 年 6 月 15 日

ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data

翻译：快捷键效果: 从 Tensorflow 到基于 FPGA 的加速器, 用于快捷键数据, 配有可再利用的注意到内存配置

Duy Thanh Nguyen,Hyeonseung Je,Tuan Nghia Nguyen,Soojung Ryu,Kyujung Lee,Hyuk-Jae Lee

from arxiv, Under review

Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti, the proposed design is 1.35-2.33x faster and 6.7-7.9x more power efficient. Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.

翻译：残留区块是最近最先进的CNN 中非常常见的元素。例如, 高效Net 或高效DNN。快捷式数据占ResNet152 [8] 中近40%的地貌图访问量。大多数前 DNN 编译器、加速器忽略了快捷式数据优化。本文为基于 FPGA 的加速器提供了一个快捷式Fusion 优化工具, 为快捷式数据配置了一个再利用感知的静态存储器, 以便根据资源限制, 最大限度地再利用机芯数据。从 TensorFlow DNNN 模型中, 拟议的设计为一组节点生成了指令数据集, 该节点对每个剩余区块使用最优化的数据再利用。在 Xillinx KCUCU 1500 PPGA 卡上安装的加速器设计大大超过 NVMIDIA RTX 2080 Ti、 Titan Xp 和 GTTX 1080 Ti 用于高效网络引用。与 RTX 2080 Ti 相比,, 的拟议设计是 1.35- fricknal- slix 快速访问访问进入更快和6. 节节节节的节节节节节节节节节节节节节节, 。。和节中, 从每节中, 直路路路段内, 直路段内运行运行减后, 直路段, 直路段, 直路路路段, 直路路路路数据。

0

相关内容

可约的

1000层的GNN效果如何？ICML2021论述训练1000层的图神经网络大模型！

专知会员服务

37+阅读 · 2021年6月16日

【Google】神经辐射场，Neural Radiance Fields，74页ppt

专知会员服务

74+阅读 · 2021年5月28日

最新《联邦学习Federated Learning》报告，Federated Learning

最新《联邦学习Federated Learning》报告，Federated Learning

专知会员服务

89+阅读 · 2020年12月2日

回顾机器学习公平的数学框架，Review of Mathematical frameworks for Fairness in Machine Learning

回顾机器学习公平的数学框架，Review of Mathematical frameworks for Fairness in Machine Learning

专知会员服务

38+阅读 · 2020年5月30日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

为什么有些模型FLOPs很低，推理速度却很慢？

为什么有些模型FLOPs很低，推理速度却很慢？

极市平台

14+阅读 · 2020年4月27日

GPU 显存不足怎么办？

GPU 显存不足怎么办？

AINLP

13+阅读 · 2019年8月16日

已删除

将门创投

14+阅读 · 2019年5月29日

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Deep Compression/Acceleration：模型压缩加速论文汇总

Deep Compression/Acceleration：模型压缩加速论文汇总

极市平台

14+阅读 · 2019年5月15日

【干货】Batch Normalization: 如何更快地训练深度神经网络

【干货】Batch Normalization: 如何更快地训练深度神经网络

专知

13+阅读 · 2018年3月6日

Deep Reinforcement Learning 深度增强学习资源

Deep Reinforcement Learning 深度增强学习资源

数据挖掘入门与实战

7+阅读 · 2017年11月4日

前端高性能计算（4）：GPU加速计算

前端高性能计算（4）：GPU加速计算

前端大全

7+阅读 · 2017年10月26日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

Improving Readability of Scratch Programs with Search-based Refactoring

Arxiv

0+阅读 · 2021年8月16日

PIM-DRAM: Accelerating Machine Learning Workloads using Processing in Commodity DRAM

Arxiv

0+阅读 · 2021年8月16日

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design

Arxiv

0+阅读 · 2021年8月14日

Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

Arxiv

0+阅读 · 2021年8月13日

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Arxiv

0+阅读 · 2021年8月13日

perf4sight: A toolflow to model CNN training performance on Edge GPUs

Arxiv

0+阅读 · 2021年8月12日

TResNet: High Performance GPU-Dedicated Architecture

TResNet: High Performance GPU-Dedicated Architecture

Arxiv

8+阅读 · 2020年3月30日

NEAR: Neighborhood Edge AggregatoR for Graph Classification

NEAR: Neighborhood Edge AggregatoR for Graph Classification

Arxiv

5+阅读 · 2019年9月6日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Arxiv

10+阅读 · 2016年9月30日

VIP会员

文章信息

相关主题

相关VIP内容

1000层的GNN效果如何？ICML2021论述训练1000层的图神经网络大模型！

专知会员服务

37+阅读 · 2021年6月16日

【Google】神经辐射场，Neural Radiance Fields，74页ppt

专知会员服务

74+阅读 · 2021年5月28日

最新《联邦学习Federated Learning》报告，Federated Learning

最新《联邦学习Federated Learning》报告，Federated Learning

专知会员服务

89+阅读 · 2020年12月2日

回顾机器学习公平的数学框架，Review of Mathematical frameworks for Fairness in Machine Learning

回顾机器学习公平的数学框架，Review of Mathematical frameworks for Fairness in Machine Learning

专知会员服务

38+阅读 · 2020年5月30日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

通信行业：智能低空通感网络白皮书

3D形状生成：综述

6000字《伊朗-以色列战争解析：欺骗与信息战如何塑造公众认知》最新报告（附原文）

【博士论文】优化智能体工作流以提升信息获取效率

相关资讯

为什么有些模型FLOPs很低，推理速度却很慢？

为什么有些模型FLOPs很低，推理速度却很慢？

极市平台

14+阅读 · 2020年4月27日

GPU 显存不足怎么办？

GPU 显存不足怎么办？

AINLP

13+阅读 · 2019年8月16日

已删除

将门创投

14+阅读 · 2019年5月29日

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Deep Compression/Acceleration：模型压缩加速论文汇总

Deep Compression/Acceleration：模型压缩加速论文汇总

极市平台

14+阅读 · 2019年5月15日

【干货】Batch Normalization: 如何更快地训练深度神经网络

【干货】Batch Normalization: 如何更快地训练深度神经网络

专知

13+阅读 · 2018年3月6日

Deep Reinforcement Learning 深度增强学习资源

Deep Reinforcement Learning 深度增强学习资源

数据挖掘入门与实战

7+阅读 · 2017年11月4日

前端高性能计算（4）：GPU加速计算

前端高性能计算（4）：GPU加速计算

前端大全

7+阅读 · 2017年10月26日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

相关论文

Improving Readability of Scratch Programs with Search-based Refactoring

Arxiv

0+阅读 · 2021年8月16日

PIM-DRAM: Accelerating Machine Learning Workloads using Processing in Commodity DRAM

Arxiv

0+阅读 · 2021年8月16日

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design

Arxiv

0+阅读 · 2021年8月14日

Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

Arxiv

0+阅读 · 2021年8月13日

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Arxiv

0+阅读 · 2021年8月13日

perf4sight: A toolflow to model CNN training performance on Edge GPUs

Arxiv

0+阅读 · 2021年8月12日

TResNet: High Performance GPU-Dedicated Architecture

TResNet: High Performance GPU-Dedicated Architecture

Arxiv

8+阅读 · 2020年3月30日

NEAR: Neighborhood Edge AggregatoR for Graph Classification

NEAR: Neighborhood Edge AggregatoR for Graph Classification

Arxiv

5+阅读 · 2019年9月6日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Arxiv

10+阅读 · 2016年9月30日

微信扫码咨询专知VIP会员