Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU program (kernel) is challenging, and generally only certain specific kernel configurations lead to significant increases in performance. Auto-tuning is the process of automatically optimizing software for highly-efficient execution on a target hardware platform. Auto-tuning is particularly useful for GPU programming, as a single kernel requires re-tuning after code changes, for different input data, and for different architectures. However, the discrete, and non-convex nature of the search space creates a challenging optimization problem. In this work, we investigate which algorithm produces the fastest kernels if the time-budget for the tuning task is varied. We conduct a survey by performing experiments on 26 different kernel spaces, from 9 different GPUs, for 16 different evolutionary black-box optimization algorithms. We then analyze these results and introduce a novel metric based on the PageRank centrality concept as a tool for gaining insight into the difficulty of the optimization problem. We demonstrate that our metric correlates strongly with observed tuning performance.
翻译:近些年来,图形处理器(GPU)的应用和能力都出现了惊人的增长,因为它们的平行计算能力较高,成本相对较低。然而,写一个计算高效的 GPU 程序(内核)具有挑战性,一般来说,只有某些特定的内核配置导致性能显著提高。自动调整是一个在目标硬件平台上为高效执行自动优化软件的过程。自动调整对GPU程序特别有用,因为单一个内核需要在代码修改后对不同的输入数据以及不同的结构进行重新调整。然而,搜索空间的离散性和非对流性质造成了一个挑战性的最佳化问题。在这项工作中,如果调整任务的时间预算各不相同,我们调查哪种算法能产生最快的内核。我们通过对26个不同的内核空间,即9个不同的GPUPU进行实验,对16种不同的进化黑盒优化算法进行一项调查。我们随后分析这些结果,并采用基于PageRank Centrence 概念的新指标,作为深入了解优化困难的工具。我们展示了我们的业绩指标与优化的关联性。