Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy-efficient than expert-crafted Intel CPU implementations.
翻译:数字模拟可以帮助解决复杂问题 。 这些算法大多是大规模平行的, 因而由于空间平行性, 能够很好地选择 FPGA 加速。 现代 FPGA 装置可以利用高带宽存储存储技术, 但当应用程序是内存型设计师时, 必须设计先进的通信和记忆结构, 以便高效的数据移动和芯片存储。 这个开发过程需要在特定领域专家中并不常见的硬件设计技能。 在本文件中, 我们提议一种自动工具, 从特定领域的语言( DSL) 中流出一个自动工具, 用于演示表达, 以生成在HBM设备 FPGA 上生成大规模平行加速器。 设计器可以使用这种流动来整合和评估各种编译器或硬件优化。 我们使用计算液动态(CFDD) 作为范例。 我们的流源源源源源来自高规格, 并结合一个基于内部硬件生成流程的 MLIR, 生成系统, 与平行加速器和一个专门记忆结构, 旨在充分利用现有的 CPU- FPG 80 系统, 当我们用103 的硬质G FLFPA 执行时, 时, 将一个系统升级的组件升级成一个系统, 。