Following the success of the natural language processing, the transformer for vision applications has attracted significant attention in recent years due to its excellent performance. However, existing deep learning hardware accelerators for vision cannot execute this structure efficiently due to significant model architecture differences. As a result, this paper proposes the hardware accelerator for vision transformers with row-wise scheduling, which decomposes major operations in vision transformers as a single dot product primitive for a unified and efficient execution. Furthermore, by sharing weights in columns, we can reuse the data and reduce the usage of memory. The implementation with TSMC 40nm CMOS technology only requires 262K gate count and 149KB SRAM buffer for 403.2 GOPS throughput at 600MHz clock frequency.
翻译:在自然语言处理成功之后,视觉应用变压器近年来因其出色的性能而引起极大关注,然而,由于模型结构差异巨大,现有深深学习的视觉硬件加速器无法高效地实施这一结构。因此,本文件提议为视觉变压器配备硬件加速器,并按行排排制,将视觉变压器中的主要操作分解为统一和高效执行的单一点原始产品。此外,通过在列中共享重量,我们可以再利用数据并减少记忆的使用。使用TSCM 40nm CMOS技术,只需要262K门数和149KB SRAM缓冲器,在600Mz时速频率上403.2 GOPS通过量。