In this paper, we present algorithms and implementations for the end-to-end GPU acceleration of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods described here allow for the construction of effective preconditioners for high-order problems with optimal memory usage and computational complexity. The preconditioners are based on the construction of a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for example, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and polynomial degree. For vector finite element problems in $H({\rm curl})$ and $H({\rm div})$ (e.g. for electromagnetic or radiation diffusion problems) a specially constructed interpolation-histopolation basis is used to ensure fast convergence. Detailed performance studies are carried out to analyze the efficiency of the GPU algorithms. The kernel throughput of each of the main algorithmic components is measured, and the strong and weak parallel scalability of the methods is demonstrated. The different relative weighting and significance of the algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively refined nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diffusion problem using all spaces of the finite element de Rham complex is illustrated.
翻译:在本文中,我们展示了无基质、低序、精密、高序限定元素问题的底端至端的GPU加速率的算法和实施情况。这里描述的方法允许为高序问题建造有效的先决条件,以最佳的内存使用和计算复杂度为最佳。前提条件的基础是在精细的网格上建造一个光等效的低序分解系统,该网格随后可采用代数多格预设。等值的常数独立于网状大小和多元度。对于$H(rm curl})和$H(rm div})的矢量有限元素问题,这里描述的方法允许在高序问题(例如电磁学或辐射扩散问题)上建造有效的先决条件。一个专门构建的内置-波分解基础用于确保快速趋同。进行详细的业绩研究,以便分析GPU算法的效率。测量了各主要算法组成部分的内值,以及各种强弱平行的伸缩性可度。关于精细度分析方法的细度和细度的缩缩度部分,在使用GPLA的细度分析中,其细度分析的细度分析具有重要性。