Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks -- tile-size selection and operator fusion -- and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
翻译:精确的硬件性能模型对于高效的代码生成至关重要。 汇编者可以使用这些模型来做出超光速决定, 超级优化器可以作为最小化目标, 自动化者也可以使用这些模型为特定程序找到最佳配置。 但是, 很难开发这些模型, 因为当代处理器复杂, 最近深层学习加速器的扩散增加了开发负担。 我们展示了一种学习性能模型的方法, 从一套高压计算图形程序中学习 。 我们展示了我们所学的模型在两种任务上( 瓷尺寸的选择和操作器聚合) 优于一个高度优化的分析性能模型, 并且它有助于自动测试器在使用TPU有限或昂贵的环境下发现更快的程序 。