FINN 矩阵矢量计算股 (On the RTL Implementation of FINN Matrix Vector Compute Unit)

FPGA-based accelerators are becoming more popular for deep neural network due to the ability to scale performance with increasing degree of specialization with dataflow architectures or custom data types. To reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared to register-transfer level (RTL)-based design. HLS offers faster development time, better maintainability and more flexibility in code exploration, when evaluating options for multi-dimension tensors, convolutional layers or parallelism. Thus, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml. In this paper, we present an alternative backend RTL library for FINN. We investigate and evaluate, across a spectrum of design dimensions, an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around $15\%$. On the other hand, HLS consistently requires more flip-flops (FFs) (orders-of-magnitude increase) and block RAMs (BRAMs) ($2\times$ more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to $80\%$. Furthermore, RTL also benefits from at-least a $10\times$ reduction in synthesis time. Finally the results were practically validated using a real-world use case of a multi-layer perceptron (MLP) network used in network intrusion detection. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important as compared to synthesis time reduction togther with resource benefits, this might make the RTL abstraction an attractive alternative.

翻译：FPGA 基基于 FPGA 的加速器对于深层神经网络越来越受欢迎,因为通过数据流结构或自定义数据类型,能够以更高程度的专业化来提升性能。为了减少软件工程师和数据科学家采用 FPGA、 C++ 和 OpenCL 的高级合成(HLS) 设计条目的障碍,已经采用了基于FINN 和 OpenCL 的设计条目。它们提供了更高的抽象性能,而与基于注册的转移水平(RTL) 设计层面相比,基于 RTL 的实施比原 HLS 变异性提供了更快的开发时间、更能的维持性和灵活性,在评估多功能阵列、熔化层层或平行式数据。因此, DNNNW 加速器生成框架,例如 FINN和 HLS4mlml。我们为FN 提供了一个替代后端的 RTL库。我们调查并评估一个基于 RT-L 快速化框架的运行, 也为更慢的电路流- 。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【斯坦福大学CS229】面向机器学习的线性代数和微积分要点速览(中文版)《CS 229 - Linear Algebra and Calculus refresher》by Afshine Amidi, Shervine Amidi

专知会员服务

197+阅读 · 2019年12月19日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日