Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks due to their ability to capture the global relation between features which leads to superior performance. However, they are compute-heavy and difficult to deploy in resource-constrained edge devices. Existing hardware accelerators, including those for the closely-related BERT transformer models, do not target highly resource-constrained environments. In this paper, we address this gap and propose ViTA - a configurable hardware accelerator for inference of vision transformer models, targeting resource-constrained edge computing devices and avoiding repeated off-chip memory accesses. We employ a head-level pipeline and inter-layer MLP optimizations, and can support several commonly used vision transformer models with changes solely in our control logic. We achieve nearly 90% hardware utilization efficiency on most vision transformer models, report a power of 0.88W when synthesised with a clock of 150 MHz, and get reasonable frame rates - all of which makes ViTA suitable for edge applications.
翻译:视觉变异模型,如Vit、 Swin变换器和变异器外向型等,最近由于能够捕捉导致性能优异的特征之间的全球关系,在计算机的视觉任务中获得了显著的牵引力。然而,这些变异模型是计算过重的,难以在资源限制的边缘设备中部署。现有的硬件加速器,包括密切相关的BERT变异器模型的硬件加速器,并不针对资源高度紧张的环境。在本文中,我们处理这一差距,并提议VITA -- -- 一个可配置的硬件加速器,用以推断视觉变异模型,瞄准资源限制的边缘计算装置,避免重复的离芯内存访问。我们使用一个头级管道和跨层 MLP 优化,可以支持几个常用的视觉变异模型,仅改变我们的控制逻辑。我们在大多数视觉变异型模型上实现近90%的硬件利用效率,在与150 MHz 钟合成时报告0.88W的功率,并获得合理的框架速率 -- 所有这些都使得VITA适合边缘应用。