Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card 2.8x faster and 9.9x more power efficient than NVIDIA RTX 2080 Ti for 256x256 input size. . Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.
翻译:残留区块是最近最先进的CNN 中非常常见的元素。 例如, 高效Net 或高效DNN。 快捷数据占ResNet152 [8] 中近40%的地貌图访问量。 大多数前 DNNN 编译器、 加速器忽略了快捷数据优化。 本文展示了基于 FPGA 的加速器最优化工具“ 捷径Fusion ”, 该工具为基于 FPGA 的加速器配置了一个可再利用的静态存储器, 用于快捷数据, 以便根据资源限制, 最大限度地再利用芯片数据再利用。 从 TensorFlow DNNNM 模型中, 拟议的设计为一组节点生成了指令数据集, 该节点对每个区块使用最优化的权重数据再利用。 在 Xilinlinx KCUCU1500 的 FPGA卡 2. 28x 和 9.x 节能比 NVIDIA RTX 2080 Ty 输入大小更高效。 。 与拟议基线的结果相比, 该基线的重量、 投入 和产出产出从每层访问一次从硬存储存储存储网络 减少5- 服务器 。