CHARM: Versal ACAP 构造矩阵乘数的混合异异种加速器 (CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture)

Jinming Zhuang,Jason Lau,Hanchen Ye,Zhuoping Yang,Yubo Du,Jack Lo,Kristof Denolf,Stephen Neuendorffer,Alex Jones,Jingtong Hu,Deming Chen,Jason Cong,Peipei Zhou

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. With 400 AIEs, it provides up to 6.4 TFLOPs performance for 32-bit floating-point data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers in one application. We deploy the CHARM framework for four different applications, including BERT, ViT, NCF, MLP, on the AMD Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF and MLP, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.

翻译：音量矩阵乘数( MM ) 是深层学习应用程序中最常用的内核。为了应对这些应用程序的高计算要求, 以 FPGA 和 ASIC 专用加速器为主的复杂结构已经作为充满希望的平台出现。例如, AMD/ Xilinx Versal ACAP 结构将通用 CPU 核心和可编程逻辑与 AI/ ML 优化的 AI 引擎处理器相结合。 400 AIE, 它为 32 位浮动点数据提供了高达6.4 TFL 的 TFL 性能。然而, 机器学习模型往往包含大型和小型 MMM 。虽然大型 MIC 操作可以在许多核心中高效地平行。我们观察到, ABER 自然语言处理模型中的一些小MMT 级的 MMC, 用于一个大型、单级的 MIC, 用于理论高峰性能的 5% 。因此, 一个关键问题是: 我们如何设计一个调速器,, 用于将TFLLLLLL 用于最大的通信级的通信应用中, IMMC 的有限的,, 以最大 4 MAIC 。