Machine learning (ML) workloads launch hundreds to thousands of short-running GPU kernels per iteration. With GPU compute throughput growing rapidly, CPU-side launch latency of kernels is emerging as a bottleneck. CUDA Graphs promise to address this by replaying a set of kernels with a single dispatch of the graph, removing per-kernel launch costs. However, CUDA Graphs remain surprisingly difficult to deploy correctly and efficiently. We present PyGraph - a compiler framework to maximize the coverage and benefits of CUDA Graphs for ML workloads. It introduces three novel optimizations: it applies automatic code transformations to make ML applications amenable to CUDA Graphs; it eliminates the parameter copy overheads for kernels executing in CUDA Graphs, and it selectively deploys CUDA Graphs guided by a cost-benefit analysis. For 25 ML workloads from TorchBench, HuggingFace, and TIMM, PyGraph more than doubles the benefit from deploying CUDA Graph compared to the most popular and widely used ML compiler, PyTorch2. PyGraph is built atop PyTorch2's compilation framework and requires no programmer intervention.
翻译:机器学习(ML)工作负载在每次迭代中会启动数百至数千个短时运行的GPU内核。随着GPU计算吞吐量的快速增长,内核在CPU侧的启动延迟正逐渐成为性能瓶颈。CUDA图通过以单次调度图的方式重放一组内核,从而消除每个内核的启动开销,有望解决这一问题。然而,CUDA图在实际部署中仍面临正确性与效率方面的显著挑战。本文提出PyGraph——一个旨在最大化CUDA图对ML工作负载覆盖范围与性能收益的编译器框架。该框架引入了三项创新优化:通过自动代码转换使ML应用适配CUDA图;消除在CUDA图中执行内核时的参数复制开销;以及基于成本效益分析选择性部署CUDA图。在TorchBench、HuggingFace和TIMM的25个ML工作负载测试中,与当前最流行且广泛使用的ML编译器PyTorch2相比,PyGraph将部署CUDA图带来的性能提升提高了一倍以上。PyGraph构建于PyTorch2的编译框架之上,无需程序员干预即可实现优化。