打开黑盒: GPU 代码生成期间的性能估计 (Opening the Black Box: Performance Estimation during Code Generation for GPUs)

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning. This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy. Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate. The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.

翻译：自动代码生成经常用于创建具体针对特定硬件和应用参数的算法。代码生成过程包括选择适当的代码转换、调制参数和平行战略。要覆盖巨大的搜索空间, 代码生成框架可以应用时间密集的自动调整、开发特定情景的性能模型, 或者将性能作为无形黑盒处理, 必须通过机器学习来描述。本文通过一个性能模型以及一个分析性硬硬件天体测量仪来识别相关的性能确定机制来解决选择问题。这样可以快速探索大型配置空间, 以便非常精确地识别高效的候选人。我们当前的方法针对记忆密集的 GPGPGPPU 应用程序, 并侧重于将数据传输量的正确模型转换到记忆层的所有级别。我们展示了我们的方法如何与通过机器学习必须描述的 Pystencilsten Scil 代码生成的无形黑盒。该文件用来生成范围为 4, 3D25pt stencil 和基于 Lattice Boltzmann 方法的复杂的二级液溶解解。对于两种方法来说, 它会提供一种精密级的排序, 它可以用来选择最佳的版本, 但可以选择最佳的版本的版本的版本, 。