Modern deep learning workloads run on distributed hardware and are difficult to optimize -- data, model, and pipeline parallelism require a developer to thoughtfully restructure their workload around optimized computation and communication kernels in libraries such as cuBLAS and NCCL. The logical separation between computation and communication leaves performance on the table with missed optimization opportunities across abstraction boundaries. To explore these opportunities, this paper presents CoCoNet, which consists of a compute language to express programs with both computation and communication, a scheduling language to apply transformations on such programs, and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs enables new optimizations, such as overlapping or fusion of communication with computation. CoCoNet allowed us to optimize several data, model and pipeline parallel workloads in existing deep learning systems with very few lines of code. We show significant improvements after integrating novel CoCoNet generated kernels.
翻译:在分布式硬件上进行的现代深层次学习工作量难以优化 -- -- 数据、模型和管道平行化 -- -- 需要一位开发者围绕CubBLAS和NCCL等图书馆的优化计算和通信核心进行深思熟虑地调整工作量。计算和通信之间的逻辑分离使业绩出现在桌面上,错过了跨越抽象界限的优化机会。为探索这些机会,本文件介绍了CoCoNet, 其中包括一种计算语言,用以表达包含计算和通信的程序,一种对此类程序进行转换的时间安排语言,以及一个编译者,以生成高性能核心。提供计算和通信,作为一流结构,可以实现新的优化,例如与计算通信的重叠或融合。CoCoNet让我们优化了现有深层学习系统中若干数据、模型和管道平行工作量,且代码线非常少。我们在整合了新型CoNet生成的内核后展示了重大改进。