The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for the linear solvers which often dominate the run time of scientific and engineering applications. In this paper we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory accesses with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla archives a speedup of 3.94x, 3.36x higher throughput, and 2.94x better energy efficiency. Compared to an NVIDIA A100 GPU which has 4x the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34x higher the energy efficiency.
翻译:FPGA的加速进程面临三个挑战:(1) 如何支持任意问题和终止飞行加速处理,(2) 如何协调处理模块之间的长期矢量数据流,(3) 如何保存离芯存储带宽并保持双倍(FP64精确度)。为了应对这三项挑战,我们提出:(1) 以溪流为中心的高效流处理和控制指南,(2) 分散矢量流时间安排,以协调模块之间的矢量数据流,并进一步减少离流存储器访问,同时使用双存储频道设计,(3) 混合精确计划,以节省带宽,但仍能达到有效的双精度质量解决方案。我们将其加速器原型放在Xilinx U280 HBMFGA上,(3) 如何保存离芯存储带宽带宽并保持双倍(FP64精确度 ) 。为了应对这三项挑战,我们提出:(1) 以溪流为中心的指示,用于高效流流处理和控制,(2) 分散的矢量流流流调度,以协调各模块之间的矢量数据流流流流流,进一步减少离机的内存取用量,(3) 保仍实现有效的双精度质量解决方案。 我们将其加速的加速计算其加速计算,3.94 CLAUPLA的加速到3.94x的存储速度,通过X的进度到3.94x的同步,通过X的同步数据到3.94 。