使用硬件性能对加速 GPU 自动调整趋同的反硬件性能加速 (Using hardware performance counters to speed up autotuning convergence on GPUs)

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware. In this paper, we introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time.

翻译：目前,通常使用 GPU 加速器来加速各种硬件的通用计算任务。然而,由于 GPU 架构和处理过的数据的多样性,对特定类型硬件和具体数据特性的代码进行优化可能极具挑战性。与性能相关的源代码参数的自动调整允许自动优化应用程序,并保持其可移植性能。虽然自动调试过程通常导致代码速度加快,但如果(一) 调试空间广而满是不良执行,或者(二) 由于处理过的数据或向不同硬件迁移的变化,自动调试过程必须经常重复。在本文件中,我们引入了一种搜索空间和特定硬件特性的新型方法。在实验性调整期间,该方法利用硬件性能计数器(也称为剖析反)来收集应用应用软件的自动优化。该方法需要将调控件空间抽查到任何 GPUPU 上,或者 (二) 建立问题性能调控件模型,在对各种时间进行自动调时可以使用, 将先前的输入或GPUPL 的精度调整过程比, 当我们的数据调速度需要用五种不同的方法来显示我们不同的硬件的方法时, 。使用该方法来显示我们不同的硬件的系统的方法, 。