多重双重精密度中 GPUs 最小方格 (Least Squares on GPUs in Multiple Double Precision)

This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics Processing Units (GPUs), in double double, quad double, and octo double precision. The goal is to use accelerators to offset the cost overhead caused by multiple double precision arithmetic. For the blocked Householder QR and the back substitution, of interest are those dimensions at which teraflop performance is attained. The other interesting question is the cost overhead factor that appears each time the precision is doubled. Experimental results are reported on five different NVIDIA GPUs, with a particular focus on the P100 and the V100, both capable of teraflop performance. Thanks to the high Compute to Global Memory Access (CGMA) ratios of multiple double arithmetic, teraflop performance is already attained running the double double QR on 1,024-by-1,024 matrices, both on the P100 and the V100. For the back substitution, the dimension of the upper triangular system must be as high as 17,920 to reach one teraflops on the V100, in quad double precision, and then taking only the times spent by the kernels into account. The lower performance of the back substitution in small dimensions does not prevent teraflop performance of the solver at dimension 1,024, as the time for the QR decomposition dominates. In doubling the precision from double double to quad double and from quad double to octo double, the observed cost overhead factors are lower than the factors predicted by the arithmetical operation counts. This observation correlates with the increased performance for increased precision, which can again be explained by the high CGMA ratios.

翻译：本文描述由 CAMPARY 软件生成的代码的应用, 以加速在图形处理股( GPUs) 上以最低方平方位加速解决线性系统的代码, 以双倍、四倍和 octo 双精度。目标是使用加速器来抵消由多重双精度算术引起的成本管理成本。对于被封住的套件 QR 和后置替代值而言, 感兴趣的维度是达到双倍性能的维度。另一个有趣的问题是每次精确率翻倍时出现的成本间接系数。 5种不同的 NVIDIA GPUs上报告了实验结果, 特别侧重于 P100 和 V100, 两者均具有双倍性能性能。由于全球内存存存存访问率高( CGMA) 的多重双倍性能比重率, 对于P100 和 V100 的双倍性能而言, 高级三角系统的规模必须高达17, 920 至一次双倍性能, 在V100 的轨道运行上, 的双性能比一次双倍性能直位性能,, 直位性能只能算为。