Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti, the proposed design is 1.35-2.33x faster and 6.7-7.9x more power efficient. Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.
翻译:残留区块是最近最先进的CNN 中非常常见的元素。 例如, 高效Net 或高效DNN。 快捷式数据占ResNet152 [8] 中近40%的地貌图访问量。 大多数前 DNN 编译器、 加速器忽略了快捷式数据优化 。 本文为基于 FPGA 的加速器提供了一个快捷式Fusion 优化工具, 为快捷式数据配置了一个再利用感知的静态存储器, 以便根据资源限制, 最大限度地再利用机芯数据。 从 TensorFlow DNNN 模型中, 拟议的设计为一组节点生成了指令数据集, 该节点对每个剩余区块使用最优化的数据再利用。 在 Xillinx KCUCU 1500 PPGA 卡上安装的加速器设计大大超过 NVMIDIA RTX 2080 Ti、 Titan Xp 和 GTTX 1080 Ti 用于高效网络引用。 与 RTX 2080 Ti 相比,, 的拟议设计是 1.35- fricknal- slix 快速访问访问进入更快和6. 节节节节的节节节节节节节节节节节节节节, 。 。 和节中, 从每节中, 直路路路段内, 直路段内运行运行减后, 直路段, 直路段, 直路路路段, 直路路路路数据。