Many state-of-the-art deep learning models for computer vision tasks are based on the transformer architecture. Such models can be computationally expensive and are typically statically set to meet the deployment scenario. However, in real-time applications, the resources available for every inference can vary considerably and be smaller than what state-of-the-art models use. We can use dynamic models to adapt the model execution to meet real-time application resource constraints. While prior dynamic work has primarily minimized resource utilization for less complex input images while maintaining accuracy and focused on CNNs and early transformer models such as BERT, we adapt vision transformers to meet system dynamic resource constraints, independent of the input image. We find that unlike early transformer models, recent state-of-the-art vision transformers heavily rely on convolution layers. We show that pretrained models are fairly resilient to skipping computation in the convolution and self-attention layers, enabling us to create a low-overhead system for dynamic real-time inference without additional training. Finally, we create a optimized accelerator for these dynamic vision transformers in a 5nm technology. The PE array occupies 2.26mm$^2$ and is 17 times faster than a NVIDIA TITAN V GPU for state-of-the-art transformer-based models for semantic segmentation.
翻译:许多最先进的计算机愿景任务深层学习模型都以变压器结构为基础。这些模型可以计算成本,并且通常固定地设置,以满足部署设想。然而,在实时应用中,每种推算可用的资源都可能有很大差异,而且比最先进的模型使用的时间要小。我们可以使用动态模型来调整模型执行,以适应实时应用资源的限制。虽然先前的动态工作主要将较不复杂的输入图像的资源利用减少到最低程度,同时保持准确性并侧重于CNN和早期变压器模型,例如BERT,我们调整图像变压器以适应系统动态资源限制,不受输入图像的影响。我们发现,与早期变压模型不同,最近最先进的视觉变压器在很大程度上依赖的是变压层。我们发现,预先培训的模型相当灵活,可以跳过进和自控层的计算,使我们能够为动态实时变压器创建低系统系统,无需额外的培训。最后,我们为这些动态变压器的变压器设计了一个最佳的加速器,在5n-NVA-MA2级平段的G-NI-TIA-MA-S-SDI-SV-SDI-SDI-SDI-SDI-SDRA-S-S-S-S-SV-SDI-SDI-SDRA-SDRA-S-S-S-S-S-S-SDI-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SAR-S-S-SD-SD-S-S-S-S-SD-S-SD-SD-S-S-S-S-S-S-S-SDI-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S