While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called HeatViT for efficient yet accurate ViT acceleration on embedded FPGAs. By analyzing the inherent computational patterns in ViTs, we first design an effective attention-based multi-head token selector, which can be progressively inserted before transformer blocks to dynamically identify and consolidate the non-informative tokens from input images. Moreover, we implement the token selector on hardware by adding miniature control logic to heavily reuse existing hardware components built for the backbone ViT. To improve the hardware efficiency, we further employ 8-bit fixed-point quantization, and propose polynomial approximations with regularization effect on quantization error for the frequently used nonlinear functions in ViTs. Finally, we propose a latency-aware multi-stage training strategy to determine the transformer blocks for inserting token selectors and optimize the desired (average) pruning rates for inserted token selectors, in order to improve both the model accuracy and inference latency on hardware. Compared to existing ViT pruning studies, under the similar computation cost, HeatViT can achieve 0.7%$\sim$8.9% higher accuracy; while under the similar model accuracy, HeatViT can achieve more than 28.4%$\sim$65.3% computation reduction, for various widely used ViTs, including DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, on the ImageNet dataset. Compared to the baseline hardware accelerator, our implementations of HeatViT on the Xilinx ZCU102 FPGA achieve 3.46$\times$$\sim$4.89$\times$ speedup.
翻译:虽然视觉变压器(ViTs)在计算机视觉领域不断达到新的里程碑,但其精密网络结构的计算成本和内存成本高,阻碍了其在资源有限的边缘设备上的部署。在本文中,我们提出一个名为Heat\ViT的硬件高效但准确的 ViT 嵌入的FGA 加速度框架。通过分析ViTs的内在计算模式,我们首次设计一个基于关注的多级符号选择器,该选择器可以在变压方块之前逐步插入,以便动态地识别和整合输入图像中的非信息化符号。此外,我们在硬件上安装象征性的计算选择器,增加微型控制逻辑,以大量再利用为主干线ViT建立的现有硬件组件。为了提高硬件效率,我们进一步使用8位固定点的四分化,并提议对ViTs中经常使用的非线性模型功能的二次调整性偏差。最后,我们提议一个宽度的多级识别器多级培训策略,以确定在ViT图像选择器中插入的变压式选择器,在类似数字的精度下,在SirtriT的精度下,同时进行更精确的计算。