FPGA 结构型、基于 mesh 的显性数字溶解器高级FPGA加速器设计 (High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers)

from arxiv, Preprint - Accepted to the 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2021), May 2021, Portland, Oregon USA

This paper presents a workflow for synthesizing near-optimal FPGA implementations for structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class, its computation-communication pattern, and the architectural capabilities of the FPGA to accelerate solvers from the high-performance computing domain. Key new features of the workflow are (1) the unification of standard state-of-the-art techniques with a number of high-gain optimizations such as batching and spatial blocking/tiling, motivated by increasing throughput for real-world work loads and (2) the development and use of a predictive analytic model for exploring the design space, resource estimates and performance. Three representative applications are implemented using the design workflow on a Xilinx Alveo U280 FPGA, demonstrating near-optimal performance and over 85% predictive model accuracy. These are compared with equivalent highly-optimized implementations of the same applications on modern HPC-grade GPUs (Nvidia V100) analyzing time to solution, bandwidth and energy consumption. Performance results indicate equivalent runtime performance of the FPGA implementations to the V100 GPU, with over 2x energy savings, for the largest non-trivial application synthesized on the FPGA compared to the best performing GPU-based solution. Our investigation shows the considerable challenges in gaining high performance on current generation FPGAs compared to traditional architectures. We discuss determinants for a given stencil code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design and its resulting performance.

翻译：本文展示了一个工作流程,用于综合近最优化的FPGA实施过程,用于为清晰的解决方案整合结构化的基于超模的Stencils应用程序。它利用应用类的关键特征、其计算通信模式和FPGA的建筑能力,以加速高性能计算域的解决方案。工作流程的关键新特征是:(1) 统一标准的最新技术,并采用一些高收益优化,如批量和空间阻塞/调节,其动机是增加真实世界工作负荷的吞吐量,(2) 开发和使用预测分析模型,以探索设计空间、资源估计和性能。三个有代表性的应用利用应用应用了应用系统应用系统的关键特征,以加速高性能水平的计算技术,如批量和空间阻截/调节等,其动机是增加真实世界工作负荷的量和空间阻塞量;(2) 开发和使用预测性分析模型来探索设计空间、资源估计和性能的预测性能模型。三个有代表性的应用利用XLininx Alvex Alveveo 设计流程,其最佳运行性业绩表现为GFPA的S-CS-Scial Scial Scial Streal Supal 应用,其最新版本,其高性预算执行,其最高性预算,其最高性能结构比性能性能性能性能性能结构比性能性能性能性能比性能为GFA-FA-FA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SDisal-S-SDisal-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S