The task of multi-dimensional numerical integration is frequently encountered in physics and other scientific fields, e.g., in modeling the effects of systematic uncertainties in physical systems and in Bayesian parameter estimation. Multi-dimensional integration is often time-prohibitive on CPUs. Efficient implementation on many-core architectures is challenging as the workload across the integration space cannot be predicted a priori. We propose m-Cubes, a novel implementation of the well-known Vegas algorithm for execution on GPUs. Vegas transforms integration variables followed by calculation of a Monte Carlo integral estimate using adaptive partitioning of the resulting space. m-Cubes improves performance on GPUs by maintaining relatively uniform workload across the processors. As a result, our optimized Cuda implementation for Nvidia GPUs outperforms parallelization approaches proposed in past literature. We further demonstrate the efficiency of m-Cubes by evaluating a six-dimensional integral from a cosmology application, achieving significant speedup and greater precision than the CUBA library's CPU implementation of VEGAS. We also evaluate m-Cubes on a standard integrand test suite. m-Cubes outperforms the serial implementations of the Cuba and GSL libraries by orders of magnitude speedup while maintaining comparable accuracy. Our approach yields a speedup of at least 10 when compared against publicly available Monte Carlo based GPU implementations. In summary, m-Cubes can solve integrals that are prohibitively expensive using standard libraries and custom implementations. A modern C++ interface header-only implementation makes m-Cubes portable, allowing its utilization in complicated pipelines with easy to define stateful integrals. Compatibility with non-Nvidia GPUs is achieved with our initial implementation of m-Cubes using the Kokkos framework.
翻译:物理学和其他科学领域经常遇到多维数字整合的任务,例如,在物理系统和巴伊西亚参数估算中,对物理系统系统性不确定性的影响进行建模,多维整合往往在CPU上具有时间性。许多核心结构的高效实施具有挑战性,因为无法预先预测整个整合空间的工作量。我们提议采用M-Cube,这是在GPU上执行众所周知的拉斯维加斯算法的新应用。拉斯维加斯转换了整合变量,随后利用相应空间的适应性分区计算了Monte Carlo综合接口的估计数。M-Cubes通过保持相对统一的处理器工作量,提高了GPUPU的性能。结果,我们优化了Nvidia GPU的 CUPS执行库的CUDA, 进一步展示了MC的六维集成效率,实现了显著的加速和更加精确性,在可比较的IMUC的IMUC执行过程中,我们用最易易易的GLA标准运行方式,在可比较的IMUC的IMBA中,我们用可比较的IMUC的IMF IMBA IMBS AS 执行中,我们可测量的S 。