The computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic simulations. In practical simulations, several trillions of ERIs may have to be computed for every time step. In this work, we investigate FPGAs as accelerators for the ERI computation. We use template parameters, here within the Intel oneAPI tool flow, to create customized designs for 256 different ERI quartet classes, based on their orbitals. To maximize data reuse, all intermediates are buffered in FPGA on-chip memory with customized layout. The pre-calculation of intermediates also helps to overcome data dependencies caused by multi-dimensional recurrence relations. The involved loop structures are partially or even fully unrolled for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our best knowledge, this is the first work on ERI computation on FPGAs that supports more than just the single most basic quartet class. Also, the integration of ERI computation and compression it a novelty that is not even covered by CPU or GPU libraries so far. Our evaluation shows that using 16-bit integer for the ERI compression, the fastest FPGA kernels exceed the performance of 10 GERIS ($10 \times 10^9$ ERIs per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors around $10^{-7}$ - $10^{-5}$ Hartree. The measured throughput can be accurately explained by a performance model. The FPGA kernels deployed on 2 FPGAs outperform similar computations using the widely used libint reference on a two-socket server with 40 Xeon Gold 6148 CPU cores of the same process technology by factors up to 6.0x and on a new two-socket server with 128 EPYC 7713 CPU cores by up to 1.9x.
翻译:在基于量子力学的原子模拟中,Gaussian型轨道上的电子相斥积分(ERI)的计算是一项具有挑战性的任务。在实际模拟中,每个时间步可能需要计算数万亿的ERI。在这项工作中,我们研究了使用FPGAs加速ERI计算。我们使用模板参数,在Intel oneAPI工具流中创建了256个不同的ERI四元组类别的自定义设计,具体基于它们的轨道。为了更好地利用数据,所有中间结果都被缓存在FPGA芯片上,并具有自定义的布局。预计算中间变量也有助于克服多维递归关系引起的数据依赖关系。涉及的循环结构部分或全部展开,以实现高吞吐量的FPGA核。此外,在FPGA核中集成了使用任意位宽整数的有损压缩算法。据我们所知,这是第一篇在FPGAs上支持多个四元组类别的ERI计算的文章。此外,将ERI计算和压缩集成在一起也是一项新的工作,目前CPU或GPU库尚未涵盖该项工作。我们的评估表明,使用16位整数进行ERI压缩时,最快的FPGA核心的吞吐量超过了一个Intel Stratix 10 GX 2800 FPGA上的10个GERIS(每秒10亿个ERIs),最大绝对误差在10^{-7}至10^{-5} Hartree之间。测量的吞吐量可以通过性能模型准确解释。在两个FPGA上部署的FPGA核心优于具有相同工艺技术的40个Xeon Gold 6148 CPU核心的双插座服务器上使用类似计算的libint参考并获得了高达6.0倍的性能改善,并在128个EPYC 7713 CPU核心的新双插座服务器上获得高达1.9倍的性能改善。