We demonstrate a high-performance vendor-agnostic method for massively parallel solving of ensembles of ordinary differential equations (ODEs) and stochastic differential equations (SDEs) on GPUs. The method is integrated with a widely used differential equation solver library in a high-level language (Julia's DifferentialEquations.jl) and enables GPU acceleration without requiring code changes by the user. Our approach achieves state-of-the-art performance compared to hand-optimized CUDA-C++ kernels, while performing $20-100\times$ faster than the vectorized-map (\texttt{vmap}) approach implemented in JAX and PyTorch. Performance evaluation on NVIDIA, AMD, Intel, and Apple GPUs demonstrates performance portability and vendor-agnosticism. We show composability with MPI to enable distributed multi-GPU workflows. The implemented solvers are fully featured, supporting event handling, automatic differentiation, and incorporating of datasets via the GPU's texture memory, allowing scientists to take advantage of GPU acceleration on all major current architectures without changing their model code and without loss of performance.
翻译:本文介绍一种高性能、适用于多个GPU的无厂商关联(vendor-agnostic)的方法,用于大规模并行求解普通微分方程组和随机微分方程组。该方法与一种高级编程语言 Julia 的 DifferentialEquations.jl 库集成,无需用户修改代码即可加速 GPU。相比手动优化的 CUDA-C++ 核,本方法取得了状态-of-the-art 式的性能,同时比 JAX 和 PyTorch 实现的向量化-map (\texttt{vmap}) 方法快 20-100 倍。GPU 性能可移植性和厂商无关性在 NVIDIA、AMD、Intel 和 Apple GPU 上进行了性能评估。我们展示了与 MPI 的复合性,以支持分布式多 GPU 工作流程。实现的求解器功能齐全,支持事件处理,自动微分以及通过 GPU 纹理内存集成数据集,使科学家们在所有主要当前架构上充分利用 GPU 加速,无需更改其模型代码且无损性能。