Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from making additional low-level optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer. This paper describes and implements a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. We demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. Lastly, evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures.
 翻译:各种应用都通过自动化工具来利用GPU平行结构的现有性能,从而自动利用GPU的现有性能。基于指令的编程模型,如OpenACC,是便于通过仅仅遵守对代码环的代码说明而进行平行计算的方法之一。然而,这些抽象模型往往阻止程序员通过额外的低层次优化来利用GPU的高级建筑特征,因为实际生成的计算方法隐藏在应用程序开发者手中。本文描述并采用一种新的灵活优化技术,在编程管道的尾端插入一个代码模拟器阶段来操作。我们的工具通过取代动态信息,从而允许进一步应用低层次的代码优化,从而效仿生成的代码。我们使用的工具,我们用工具支持CUDA和OpACC的指令,作为编程的前端,从而使得以前不可能为 OpACC提供的低层次的GPU优化。我们展示了我们工具的能力,在编程管道的尾端插入一个代码模级模拟器。我们的工具通过符号分析来模仿生成的代码的代码,最后,我们用一个对四代号基准和复杂的代码进行详细评估。