Creating high performance implementations of deep learning primitives on CPUs is a challenging task. Multiple considerations including multi-level cache hierarchy, and wide SIMD units of CPU platforms influence the choice of program transformations to apply for performance optimization. In this paper, we present machine learning powered compiler techniques to optimize loop nests. We take a two-pronged approach to code optimization: We first apply high level optimizations to optimize the code to take optimal advantage of the cache memories. Then, we perform low level, target-specific optimizations to effectively vectorize the code to run well on the SIMD units of the machine. For high level optimizations, we use polyhedral compilation techniques and deep learning approaches. For low level optimization, we use a target specific code generator that generates code using vector intrinsics and Reinforcement Learning (RL) techniques to find the optimal parameters for the code generator. We perform experimental evaluation of the developed techniques on various matrix multiplications that occur in popular deep learning workloads. The experimental results show that the compiler techniques presented in the paper achieve 7.6X and 8.2X speed-ups over a baseline for sequential and parallel runs respectively.
翻译:创建在CPU上进行深层原始学习的高性能执行是一项具有挑战性的任务。 多重考虑包括多级缓存等级和CPU平台的大型SIMD单位,它们影响选择程序转换以应用绩效优化。 在本文中,我们展示机器学习动力编译器技术以优化环状巢。 我们用双管齐下的方法优化代码: 我们首先应用高水平优化代码以优化缓存记忆的最佳利用。 然后, 我们进行低水平、 特定目标优化, 有效地将代码在机器的SIMD单位上运行良好。 对于高水平优化, 我们使用多面编译技术和深层学习方法。 对于低水平优化, 我们使用一个特定的目标代码生成器, 生成代码, 使用矢量固有和强化学习(RL) 技术为代码生成最佳参数。 我们对在流行深层学习工作量中出现的各种矩阵乘法的开发技术进行了实验性评估。 实验结果显示, 纸张中展示的编译器技术在连续运行和平行运行的基线上分别实现了7.6X和8.2X速度。