We show in this work that memory intensive computations can result in severe performance problems due to off-chip memory access and CPU-GPU context switch overheads in a wide range of deep learning models. For this problem, current just-in-time (JIT) kernel fusion and code generation techniques have limitations, such as rough fusion plan exploration strategies and limited code generation ability. We propose FusionStitching, a deep learning compiler capable of fusing memory intensive operators, with varied data dependencies and non-homogeneous parallelism, into large GPU kernels to reduce global memory access and context switch overhead automatically. FusionStitching widens the range of operation combinations that fusion can target beyond previous JIT works by introducing data reuse of intermediate values. It explores large fusion spaces to decide optimal fusion plans with considerations of memory access costs, kernel calls and resource usage constraints. FusionStitching tunes the optimal stitching scheme with a domain-specific cost model efficiently. Experimental results show that FusionStitching can reach up to 2.21x speedup compared to state-of-the-art, with 1.45x on average. Besides these experimental results, we integrated our approach into a compiler product and deployed it onto a production cluster for AI workloads with thousands of GPUs. The system has been in operation for more than 4 months and saves 7,000 GPU hours on average for approximately 30,000 tasks per month.
翻译:我们在这项工作中显示,由于离芯内存存存取和CPU-GPU-GPU环境在一系列深层学习模型中转换管理器,内存密集计算可能导致严重的性能问题。对于这个问题,目前的即时(JIT)内核聚变和代码生成技术有局限性,例如粗略的聚变计划探索战略和有限的代码生成能力。我们提议了FusionStitching,这是一个深层次的学习汇编器,能够将记忆密集操作操作器引信化,其数据依赖性和非同步性平行性各异,形成大型的GPU内核内核来减少全球内存存存存存存存存和上环境自动转换管理器。对于这一问题,拆解会扩大组合的操作范围,使其超越JIT以往的工作,通过引入中间值再利用数据。我们探索了大型的聚变电空间来决定最佳的融合计划,同时考虑到内存存存存存存存存的成本、内核呼叫和资源使用的限制。 将精细的节存留机制与一个具体领域成本模型同步地调整最佳的缝制计划。 实验结果显示,FionStitchSitchSitch系统可在2.21个月内完成Gx平均生产结果,并在这些Ax上进行。