We present our experience in porting optimized CUDA implementations to oneAPI. We focus on the use case of numerical integration, particularly the CUDA implementations of PAGANI and $m$-Cubes. We faced several challenges that caused performance degradation in the oneAPI ports. These include differences in utilized registers per thread, compiler optimizations, and mappings of CUDA library calls to oneAPI equivalents. After addressing those challenges, we tested both the PAGANI and m-Cubes integrators on numerous integrands of various characteristics. To evaluate the quality of the ports, we collected performance metrics of the CUDA and oneAPI implementations on the Nvidia V100 GPU. We found that the oneAPI ports often achieve comparable performance to the CUDA versions, and that they are at most 10% slower.
翻译:我们把优化CUDA实施到一个API的经验介绍给一个API。我们侧重于数字整合的使用案例,特别是CUDA实施PAGANI和百万美元立方体的情况。我们面临着导致单一API港口性能退化的若干挑战,其中包括每条线的废旧登记簿、编译器优化以及CUDA图书馆调用单API等量的地图等差异。在应对这些挑战后,我们测试了PAGANI和m-Cubes集成商的多种不同特性的集合体。为了评估港口的质量,我们收集了CUDA和1AIPI实施Nvidia V100GPU的性能指标。我们发现,一个API港口的性能往往与CUDA版本相似,而且速度最多为10%。