GPUs have climbed up to the top of supercomputer systems making life harder to many legacy scientific codes. Nowadays, many recipes are being used in such code's portability, without any clarity of which is the best option. We present a comparative analysis of the two most common approaches, CUDA and OpenACC, into the multi-physics CFD code Alya. Our focus is the combustion problems which are one of the most computing demanding CFD simulations. The most computing-intensive parts of the code were analyzed in detail. New data structures for the matrix assembly step have been created to facilitate a SIMD execution that benefits vectorization in the CPU and stream processing in the GPU. As a result, the CPU code has improved its performance by up to 25%. In GPU execution, CUDA has proven to be up to 2 times faster than OpenACC for the assembly of the matrix. On the contrary, similar performance has been obtained in the kernels related to vector operations used in the linear solver, where there is minimal memory reuse.
翻译:目前,许多配方都用于这种配方的可移动性,而这种配方却没有任何明确性,这是最好的选择。我们对两种最常见的方法,即CUDA和OpenACC, 进行了比较分析,以纳入多物理学的CFD代码 Alya。我们的重点是燃烧问题,这是要求CFD模拟中最需要计算机解码的最需要计算的问题之一。对代码中最需要计算密集的部分进行了详细分析。为矩阵组装步骤建立了新的数据结构,以便利SIMD执行有利于CPU和GPU流程处理中的传导化。结果,CPU代码提高了高达25%的性能。在GPU执行中,CUDA已证明比对矩阵组装的开放ACC速度快了2倍。相反,在线性求解器使用的矢量操作中,在最小的存储再利用方面,在与矢量操作有关的内仓中也取得了类似的性能。