Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several machine learning aware transformations to optimize a program and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNeT enables us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Experiments show CoCoNeT significantly outperforms state-of-the-art distributed machine learning implementations.
翻译:考虑到培训这些模型的庞大成本,我们必须在计算和通信中解开优化的优化,以取得最佳业绩。然而,深学习框架中计算和通信核心之间的当前逻辑分解却错过了这种障碍的优化机会。将这一抽取与整体考虑分开,可以提供许多优化,以提高分布式工作量的绩效。手工应用这些优化需要修改每个情景的基本计算和通信库,因为每个情景既耗时又容易出错。因此,我们向COoNET提交一个DSL,以表达一个与计算和通信有关的程序。CoNET包含一些机器学习意识转换,以优化一个程序,而编译者则生成高性能核心。将计算和通信作为一流结构提供,使用户能够进行高层次的抽取工作,并应用强大的优化,例如通信和计算的聚合或重叠。CoNeT使我们能够在大型语言模型中优化数据、模型和管道工作量,只有几行版本的代码。CoNeuringS-CON-CoPOLPER 外出大型语言模型,只有几行的学习模式。