We demonstrate a high-performance vendor-agnostic method for massively parallel solving of ensembles of ordinary differential equations (ODEs) and stochastic differential equations (SDEs) on GPUs. The method is integrated with a widely used differential equation solver library in a high-level language (Julia's DifferentialEquations.jl) and enables GPU acceleration without requiring code changes by the user. Our approach achieves state-of-the-art performance compared to hand-optimized CUDA-C++ kernels, while performing $20-100\times$ faster than the vectorized-map (vmap) approach implemented in JAX and PyTorch. Performance evaluation on NVIDIA, AMD, Intel, and Apple GPUs demonstrates performance portability and vendor-agnosticism. We show composability with MPI to enable distributed multi-GPU workflows. The implemented solvers are fully featured, supporting event handling, automatic differentiation, and incorporating of datasets via the GPU's texture memory, allowing scientists to take advantage of GPU acceleration on all major current architectures without changing their model code and without loss of performance.
翻译:我们展示了一种高性能、供应商不可知的方法,用于在GPU上大规模并行求解普通微分方程(ODEs)和随机微分方程(SDEs)的集合。该方法与一种高级语言(Julia的DifferentialEquations.jl)中广泛使用的微分方程求解器库集成,使用户无需进行代码更改即可实现GPU加速。我们的方法实现了与手工优化的CUDA-C++内核相媲美的性能,同时比JAX和PyTorch实现的矢量化map(vmap)方法快20-100倍。在NVIDIA、AMD、Intel和Apple GPU上进行的性能评估显示了性能可移植性和供应商不可知性。我们还展示了与MPI的组合性,以实现分布式多GPU工作流程。实现的求解器具有完全的功能支持,包括事件处理、自动微分和通过GPU纹理内存将数据集整合到模型中,使科学家能够充分利用所有主要当前体系结构的GPU加速,而无需更改其模型代码,并且没有性能损失。