This work investigates a variant of the conjugate gradient (CG) method and embeds it into the context of high-order finite-element schemes with fast matrix-free operator evaluation and cheap preconditioners like the matrix diagonal. Relying on a data-dependency analysis and appropriate enumeration of degrees of freedom, we interleave the vector updates and inner products in a CG iteration with the matrix-vector product with only minor organizational overhead. As a result, around 90% of the vector entries of the three active vectors of the CG method are transferred from slow RAM memory exactly once per iteration, with all additional access hitting fast cache memory. Node-level performance analyses and scaling studies on up to 147k cores show that the CG method with the proposed performance optimizations is around two times faster than a standard CG solver as well as optimized pipelined CG and s-step CG methods for large sizes that exceed processor caches, and provides similar performance near the strong scaling limit.
翻译:这项工作调查了同质梯度(CG)方法的变式,并将其嵌入高端有限元素计划的背景中,采用快速的无基操作员评价和像矩阵对数仪这样的廉价先决条件。依靠数据依赖分析和适当列举自由度,我们将矢量更新和内产物与只有少量组织间接费用的矩阵矢量摄像产品放在一个 CG 循环中。结果,CG 方法的三种有效矢量的矢量输入中,大约90%从慢的内存中转移,每循环一次,所有额外访问都达到快速缓冲存储。对多达147k核心的无底级性能分析和扩大研究显示,采用拟议性能优化的CG方法比标准的CG解决问题器以及超过处理器缓存的大型的优化输油管CG和S-step CG方法大约高出2倍,并提供了接近强缩放限制的类似性能。