Automated design of efficient transformer models has recently attracted significant attention from industry and academia. However, most works only focus on certain metrics while searching for the best-performing transformer architecture. Furthermore, running traditional, complex, and large transformer models on low-compute edge platforms is a challenging problem. In this work, we propose a framework, called ProTran, to profile the hardware performance measures for a design space of transformer architectures and a diverse set of edge devices. We use this profiler in conjunction with the proposed co-design technique to obtain the best-performing models that have high accuracy on the given task and minimize latency, energy consumption, and peak power draw to enable edge deployment. We refer to our framework for co-optimizing accuracy and hardware performance measures as EdgeTran. It searches for the best transformer model and edge device pair. Finally, we propose GPTran, a multi-stage block-level grow-and-prune post-processing step that further improves accuracy in a hardware-aware manner. The obtained transformer model is 2.8$\times$ smaller and has a 0.8% higher GLUE score than the baseline (BERT-Base). Inference with it on the selected edge device enables 15.0% lower latency, 10.0$\times$ lower energy, and 10.8$\times$ lower peak power draw compared to an off-the-shelf GPU.
翻译:摘要:自动设计高效的Transformer模型最近在工业界和学术界引起了极大的关注。然而,大多数研究只关注某些指标,在搜索最佳表现的Transformer架构时忽略了其他指标。此外,在低计算力的边缘平台上运行传统的、复杂的、大型Transformer模型是一个具有挑战性的问题。本文提出了一个框架ProTran,用于在不同的边缘设备上探测一系列Transformer架构的硬件性能。同时,使用提出的共同设计技术,获取最佳性能模型,以便能够高精度地完成给定任务,同时能够在边缘部署时降低延迟、能耗和峰值功率消耗。我们将为优化精度和硬件性能指标的框架称为EdgeTran。它对最佳Transformer模型和边缘设备进行了搜索匹配。最后,我们提出了GPTran,一种多阶段块级增长与剪枝的后处理步骤,以硬件感知的方式进一步提高了精度。所得到的Transformer模型比基线(BERT-Base)小2.8倍,并且具有0.8%更高的GLUE分数。在选择的边缘设备上进行推断,相对于现成的GPU设备,EdgeTran可以实现15.0%的更低延迟、10.0倍的能源消耗降低以及10.8倍的峰值功率降低。