TransCODE: 面向高效训练和推理的Transformer和加速器共同设计 (TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference)

Automated co-design of machine learning models and evaluation hardware is critical for efficiently deploying such models at scale. Despite the state-of-the-art performance of transformer models, they are not yet ready for execution on resource-constrained hardware platforms. High memory requirements and low parallelizability of the transformer architecture exacerbate this problem. Recently-proposed accelerators attempt to optimize the throughput and energy consumption of transformer models. However, such works are either limited to a one-sided search of the model architecture or a restricted set of off-the-shelf devices. Furthermore, previous works only accelerate model inference and not training, which incurs substantially higher memory and compute resources, making the problem even more challenging. To address these limitations, this work proposes a dynamic training framework, called DynaProp, that speeds up the training process and reduces memory consumption. DynaProp is a low-overhead pruning method that prunes activations and gradients at runtime. To effectively execute this method on hardware for a diverse set of transformer architectures, we propose ELECTOR, a framework that simulates transformer inference and training on a design space of accelerators. We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models with high accuracy on the given task and minimize latency, energy consumption, and chip area. The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair while incurring 5.2$\times$ lower latency and 3.0$\times$ lower energy consumption.

翻译：自动化的机器学习模型和评估硬件的共同设计对于有效地部署这些模型至关重要。尽管Transformer模型的性能处于最先进水平，但它们尚未准备好在资源受限的硬件平台上执行。Transformer架构的高内存需求和低并行性加剧了这个问题。最近提出的加速器尝试优化Transformer模型的吞吐量和能耗。但是，这样的工作要么仅限于模型架构的单侧搜索，要么仅限于一组限制性的现成设备。此外，以前的工作仅加速模型的推理，而不是训练，导致其占用更高的内存和计算资源，使问题变得更加具有挑战性。为了解决这些限制，本文提出了一种称为DynaProp的动态训练框架，它可以加速训练过程并减少内存消耗。 DynaProp是一种低开销的修剪方法，可以在运行时修剪激活和梯度。为了在加速器设计空间上有效地执行这种方法，我们提出了ELECTOR，一个框架，用于模拟Transformer推理和训练。我们使用此模拟器与所提出的共同设计技术TransCODE相结合，以获得在给定任务上具有高准确性的最佳性能模型，并最小化延迟、能耗和芯片面积。得到的Transformer-加速器组合在比最先进组合低5.2倍的延迟和3.0倍的能耗下实现了比最先进组合高0.3％的准确度。