XFormer: 变换器在记忆中的加速</s> (X-Former: In-Memory Acceleration of Transformers)

Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions of parameters, and therefore require a large number of DRAM accesses. Hence, traditional deep neural network (DNN) accelerators such as GPUs and TPUs face limitations in processing Transformers efficiently. In-memory accelerators based on non-volatile memory promise to be an effective solution to this challenge, since they provide high storage density while performing massively parallel matrix vector multiplications within memory arrays. However, attention score computations, which are frequently used in Transformers (unlike CNNs and RNNs), require matrix vector multiplications (MVM) where both operands change dynamically for each input. As a result, conventional NVM-based accelerators incur high write latency and write energy when used for Transformers, and further suffer from the low endurance of most NVM technologies. To address these challenges, we present X-Former, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements to execute transformer workloads efficiently. To improve the hardware utilization of X-Former, we also propose a sequence blocking dataflow, which overlaps the computations of the two processing elements and reduces execution time. Across several benchmarks, we show that X-Former achieves upto 85x and 7.5x improvements in latency and energy over a NVIDIA GeForce GTX 1060 GPU and upto 10.7x and 4.6x improvements in latency and energy over a state-of-the-art in-memory NVM accelerator.

翻译：各种自然语言处理( NLP) 任务中, 变异器在广泛的自然语言处理( NLP) 任务中取得了巨大成功, 这是由于关注机制的缘故, 它赋予了每个字相对于一个序列中其他词的分数的重要性。然而, 这些模型非常庞大, 往往达到数千亿参数, 因此需要大量 DRAM 访问。因此, 传统的深神经网络( DNN) 加速器, 如 GPUs 和 TPU 等, 在高效处理变异器中面临限制。基于非挥发性记忆的模拟加速器, 有望成为应对这一挑战的有效解决方案, 因为它们提供高存储密度的存储密度, 同时在存储阵列中进行大量平行的矩阵矢量倍增量。然而, 注意分数计算方法, 经常用于变异器( 不同于CNNW和 RNNS) 的变异器增量。常规的变异动加速器在用于变异器时, 高写惯性通度和写能量的加速器, 进一步受到两个变异性变异器的变异器的内变异器,, 和变异性机的内, 和变异器的低的内变器的内, 和变异性变器的变器的变器的变器的变器, 和变异性机的变器的变器的变器的变器的变器的变器的变器和变器的变器的变器, 和变和变器的变器的变器的变数,,, 和变器的变式的变器的变器的变器的变的变的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变器的变数和变器和变器的变的变器的变器的变的变器的变器的变器的变器的变数和变器和变器的变数, 和变的变器的变</s>