FPGAs上量化模拟中电子相斥积分的计算和压缩 (Computing and Compressing Electron Repulsion Integrals on FPGAs)

The computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic simulations. In practical simulations, several trillions of ERIs may have to be computed for every time step. In this work, we investigate FPGAs as accelerators for the ERI computation. We use template parameters, here within the Intel oneAPI tool flow, to create customized designs for 256 different ERI quartet classes, based on their orbitals. To maximize data reuse, all intermediates are buffered in FPGA on-chip memory with customized layout. The pre-calculation of intermediates also helps to overcome data dependencies caused by multi-dimensional recurrence relations. The involved loop structures are partially or even fully unrolled for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our best knowledge, this is the first work on ERI computation on FPGAs that supports more than just the single most basic quartet class. Also, the integration of ERI computation and compression it a novelty that is not even covered by CPU or GPU libraries so far. Our evaluation shows that using 16-bit integer for the ERI compression, the fastest FPGA kernels exceed the performance of 10 GERIS ($10 \times 10^9$ ERIs per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors around $10^{-7}$ - $10^{-5}$ Hartree. The measured throughput can be accurately explained by a performance model. The FPGA kernels deployed on 2 FPGAs outperform similar computations using the widely used libint reference on a two-socket server with 40 Xeon Gold 6148 CPU cores of the same process technology by factors up to 6.0x and on a new two-socket server with 128 EPYC 7713 CPU cores by up to 1.9x.

翻译：在基于量子力学的原子模拟中，Gaussian型轨道上的电子相斥积分（ERI）的计算是一项具有挑战性的任务。在实际模拟中，每个时间步可能需要计算数万亿的ERI。在这项工作中，我们研究了使用FPGAs加速ERI计算。我们使用模板参数，在Intel oneAPI工具流中创建了256个不同的ERI四元组类别的自定义设计，具体基于它们的轨道。为了更好地利用数据，所有中间结果都被缓存在FPGA芯片上，并具有自定义的布局。预计算中间变量也有助于克服多维递归关系引起的数据依赖关系。涉及的循环结构部分或全部展开，以实现高吞吐量的FPGA核。此外，在FPGA核中集成了使用任意位宽整数的有损压缩算法。据我们所知，这是第一篇在FPGAs上支持多个四元组类别的ERI计算的文章。此外，将ERI计算和压缩集成在一起也是一项新的工作，目前CPU或GPU库尚未涵盖该项工作。我们的评估表明，使用16位整数进行ERI压缩时，最快的FPGA核心的吞吐量超过了一个Intel Stratix 10 GX 2800 FPGA上的10个GERIS（每秒10亿个ERIs），最大绝对误差在10^{-7}至10^{-5} Hartree之间。测量的吞吐量可以通过性能模型准确解释。在两个FPGA上部署的FPGA核心优于具有相同工艺技术的40个Xeon Gold 6148 CPU核心的双插座服务器上使用类似计算的libint参考并获得了高达6.0倍的性能改善，并在128个EPYC 7713 CPU核心的新双插座服务器上获得高达1.9倍的性能改善。

相关内容

FPGA

关注 18

FPGA：ACM/SIGDA International Symposium on Field-Programmable Gate Arrays。 Explanation：ACM/SIGDA现场可编程门阵列国际研讨会。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/fpga/

Genome Biology | DeepRepeat: 对纳米孔测序信号数据的短串联重复进行直接的量化分析

专知会员服务

3+阅读 · 2022年10月9日

【牛津大学博士论文】流形的几何优化与深度学习的应用，154页pdf，Geometric Optimisation on Manifolds with Applications to Deep Learning

专知会员服务

22+阅读 · 2022年3月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日