We present an implementation of a fully stage-parallel preconditioner for Radau IIA type fully implicit Runge--Kutta methods, which approximates the inverse of $A_Q$ from the Butcher tableau by the lower triangular matrix resulting from an LU decomposition and diagonalizes the system with as many blocks as stages. For the transformed system, we employ a block preconditioner where each block is distributed and solved by a subgroup of processes in parallel. For combination of partial results, we either use a communication pattern resembling Cannon's algorithm or shared memory. A performance model and a large set of performance studies (including strong scaling runs with up to 150k processes on 3k compute nodes) conducted for a time-dependent heat problem, using matrix-free finite element methods, indicate that the stage-parallel implementation can reach higher throughputs when the block solvers operate at lower parallel efficiencies, which occurs near the scaling limit. Achievable speedup increases linearly with number of stages and are bounded by the number of stages. Furthermore, we show that the presented stage-parallel concepts are also applicable to the case that $A_Q$ is directly diagonalized, which requires complex arithmetic or the solution of two-by-two blocks and sequentializes parts of the algorithm. Alternatively to distributing stages and assigning them to distinct processes, we discuss the possibility of batching operations from different stages together.
翻译:我们为Radau IIA 类型完全隐含的龙格-库塔方法推出了一个完全阶段和平行的预设条件,该预设条件通过LU分解和分解系统,以各个阶段的多个区块对系统进行分解和分解,使Butcher 台面的较低三角矩阵,与Butcher 台面上美元=美元=美元=美元=美元=美元=美元=美元=美元=美元=美元=美元=美元=千分之一;对系统转型系统,我们使用一个区块的先决条件,每个区块由平行的进程分组分配和解决;对部分结果的结合,我们要么使用类似于Cannonon的算法或共享记忆的通信模式;一个业绩模型和一套大型的绩效研究(包括以3k compute 节点为3k 节点为至多150k 的流程进行强有力的缩放宽度计算方法),这些模型和一套大型的绩效研究的反差差差的三角矩阵,其相偏差的三角矩阵实施过程在区块解决问题时可以达到更高的分数;当区块解决问题时,在缩小的分级阶段和分级之间进行分级分析时,我们可直接讨论。