Efficiently solving large-scale linear systems is a critical challenge in electromagnetic simulations, particularly when using the Crank-Nicolson Finite-Difference Time-Domain (CN-FDTD) method. Existing iterative solvers are commonly employed to handle the resulting sparse systems but suffer from slow convergence due to the ill-conditioned nature of the double-curl operator. Approximate preconditioners, like Successive Over-Relaxation (SOR) and Incomplete LU decomposition (ILU), provide insufficient convergence, while direct solvers are impractical due to excessive memory requirements. To address this, we propose FlashMP, a novel preconditioning system that designs a subdomain exact solver based on discrete transforms. FlashMP provides an efficient GPU implementation that achieves multi-GPU scalability through domain decomposition. Evaluations on AMD MI60 GPU clusters (up to 1000 GPUs) show that FlashMP reduces iteration counts by up to 16x and achieves speedups of 2.5x to 4.9x compared to baseline implementations in state-of-the-art libraries such as Hypre. Weak scalability tests show parallel efficiencies up to 84.1%.
翻译:高效求解大规模线性系统是电磁仿真中的关键挑战,特别是在使用Crank-Nicolson时域有限差分(CN-FDTD)方法时。现有迭代求解器虽常用于处理由此产生的稀疏系统,但由于双旋度算子的病态特性,其收敛速度缓慢。近似预条件子(如逐次超松弛法(SOR)和不完全LU分解(ILU))收敛性不足,而直接求解器因内存需求过大而不实用。为此,我们提出FlashMP,一种基于离散变换设计子域精确求解器的新型预条件系统。FlashMP提供了高效的GPU实现,通过区域分解实现多GPU可扩展性。在AMD MI60 GPU集群(最多1000个GPU)上的评估表明,与Hypre等先进库中的基准实现相比,FlashMP将迭代次数减少高达16倍,并获得2.5倍至4.9倍的加速比。弱可扩展性测试显示并行效率最高可达84.1%。