Incomplete LU (ILU) smoothers are effective in the algebraic multigrid (AMG) $V$-cycle for reducing high-frequency components of the error. However, the requisite direct triangular solves are comparatively slow on GPUs. Previous work by Antz et al. (2015) demonstrated the advantages of Jacobi iteration as an alternative to direct solution of these systems. Depending on the threshold and fill-level parameters chosen, the factors can be highly non-normal and, in this case, Jacobi is unlikely to converge in a low number of iterations. We demonstrate that row scaling can reduce the departure from normality, allowing us to replace the inherently sequential solve with a rapidly converging Richardson iteration. There are several advantages beyond the lower compute time. Scaling is performed locally for a diagonal block of the global matrix because it is applied directly to the factor. Further, an ILUT Schur complement smoother maintains a constant GMRES iteration count as the number of MPI ranks increases, and thus parallel strong-scaling, is improved. Our algorithms have been incorporated into hypre, and we demonstrate improved time to solution for Nalu-Wind and PeleLM pressure solvers. For large problem sizes, GMRES$+$AMG executes at least five times faster when using iterative triangular solves compared with direct solves on massively-parallel GPUs.
翻译:完整 LU (ILU) 平滑器在代数多格( AMG) 美元周期中有效, 减少了错误的高频组件。 但是, 必要的直直三角解答在 GPU 上相对缓慢。 Antz 等人( 2015) 以前的工作展示了Jacodi 迭代作为直接解决这些系统的替代方的优点。 根据所选的阈值和填充参数, 这些因素可能非常不正常, 在本案中, Jacobi 不太可能在低迭代中聚集在一起。 我们证明, 行缩缩缩缩可以减少偏离正常度, 使我们能够以快速的 Richardson 迭代换内在的顺序解析。 在更低的计算时间之外, 还有一些优势。 缩缩为全球矩阵的对等方块, 因为它直接应用到因素 。 此外, ILUT Schur 补充的平滑动会保持恒定的 GRES 补数计数, 随着 MPI 级数的增加, 从而平行的加缩。 我们的算法已经被纳入了$ GHSO+ GVAR 的大幅解度, 。 我们用最慢的平平压和时间 演示了 。