The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configuration's performance by constructing confidence intervals to describe the performance of individual kernels (subroutines of benchmarked programs). Once a kernel's performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We then leverage online execution path analysis to coordinate selective kernel execution and propagate each kernel's statistical profile. This strategy is effective in the presence of frequently-recurring computation and communication kernels, which is characteristic to algorithms in numerical linear algebra. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We evaluate performance prediction accuracy obtained by our selective execution methods using state-of-the-art distributed-memory implementations of Cholesky and QR factorization on Stampede2, and demonstrate speed-ups of up to 7.1x with 98% prediction accuracy.
翻译:大规模自动性能调制的高昂费用在很大程度上限制了对共享内核和 GPU 结构图书馆的自动调制的使用。我们引入了近似自动调制框架,通过建立信任间隔来描述单个内核的性能(基准程序子常规),从而实现对每种算法配置性能的预期信任度;一旦内核的性能被认为对一组投入而言足够可预测,随后的援引就被避免,代之以一个执行时间的预测模型。然后我们利用在线执行路径分析来协调选择性内核执行和传播每个内核的统计特征。这一战略对于经常不断进行的计算和通信内核是有效的,这是数字线性代数的计算和通信内核的特征。我们把这一框架封成一种新的剖面工具之一,即Critter,自动将内核执行决定与执行的关键路径传播统计概况。我们用我们选择性执行方法获得的绩效预测准确性,我们使用最新分布式的内核和QR的统计特征,显示Cholesky 和Q-R 的精确度预测速度,并显示Starimationsion 98- 和Staridal-x 。