Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box machine learning to select the best-performing configuration. This paper identifies the relevant performance-defining mechanisms for memory-intensive GPU applications through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient code candidates with high accuracy. We examine the changes of the A100 GPU architecture compared to the predecessor V100 and address the challenges of how to model the data transfer volumes through the new memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range-four 3D-25pt stencil and a complex two-phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best-performing candidate. The method is not limited to stencil kernels but can be integrated into any code generator that can generate the required address expressions.
翻译:自动代码生成经常用于创建具体针对特定硬件和应用程序参数的算法。 代码生成过程包括选择适当的代码转换、 调制参数和平行战略。 我们提出一个替代时间密集自动调制、 情景特定性能模型或黑盒机器学习选择最佳配置的替代方案。 本文通过一个性能模型以及一个分析性硬硬件计量仪, 确定了存储密集型 GPU 应用程序的相关性能定义机制。 这样可以快速探索大型配置空间, 以便识别高效的代码候选人。 我们检查了 A100 GPU 结构与前代 V100 相比的变化, 并解决了如何在新的记忆等级中模拟数据传输量的挑战。 我们展示了我们的方法如何与 Pystencils stencil 代码生成器相连接, 该功能用来生成范围为 4, 3D- 25pt stencil 和基于 Lattic Boltzmann 方法的复杂的两阶段液体解析器。 我们检查了 A100 GPPU 结构结构与前代V100 之间的变化, 并解决了如何模拟数据传输数据传输数据, 但Stenner 需要, 可以选择最佳的版本, 格式为最佳的版本, 可以选择最佳的版本。 选择最佳版本。