NeuronMM：面向AWS Trainium上LLM推理的高性能矩阵乘法 (NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium)

AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.

翻译：为AI工作负载定制的AI加速器，为训练和推理提供了经济高效且高性能的解决方案。Trainium是亚马逊网络服务（AWS）近期开发的一款AI加速器，其异构架构为大型语言模型（LLM）的训练和推理提供了一个极具吸引力的选择。然而，由于其脉动阵列架构和对数据布局的特殊要求，在Trainium架构上实现高性能计算颇具挑战。本文针对Trainium上的LLM推理，设计了高性能矩阵乘法（matmul）这一关键计算内核。我们基于内核融合和新型缓存策略，引入了一系列针对Trainium定制的技术，以减少软件管理内存层次结构间的数据移动、最大化SRAM带宽，并避免昂贵的矩阵转置操作。通过在九个数据集和四个近期LLM上的评估，我们表明，我们的系统在Trainium上显著优于由AWS实现的最先进矩阵乘法：在矩阵乘法内核层面，平均实现了1.35倍（最高达2.22倍）的加速；这转化为端到端LLM推理平均1.66倍（最高达2.49倍）的加速。