Recently, large models have achieved the state of the art performances in various fields. In order to support large model training, we have to use distributed training techniques. However, finding an efficient distributed execution plan not only requires fine-grained model statistics, such as memory and computing overhead of each operator but also is a labor-intensive task even for an expert in the field of distributed training. In this paper, we introduce MAP, a compiler built upon PyTorch to implement Memory-aware Automated Parallelization. To profiling operator costs, existing training systems and machine learning pipelines either physically execute with respect to each operand or estimate the memory usage with a scaled input tensor, which are often time-consuming and misleading. Compared with existing methods, MAP provides an easy-to-use symbolic profiler to generate memory and computing statistics of an arbitrary PyTorch model with trivial time cost, so it will boost high productivity for ML developers. In addition, MAP can also seamlessly speed up different static planning tasks on computation graphs for PyTorch, and requires only a few lines of modification to user code to generate a new module instance that has a top-performing distributed execution plan. The source code is publicly available at https://github.com/hpcaitech/ColossalAI
翻译:最近,大型模型在各个领域达到了艺术表演的状态。为了支持大型模型培训,我们必须使用分布式培训技术。然而,找到高效分布式执行计划不仅需要精细的模型统计,例如每个操作者的记忆和计算间接费用,而且即使是分布式培训领域的专家也是一项劳动密集型任务。在本文中,我们引入了在PyTorrch上建起的一个汇编器MAP, 以实施记忆觉醒自动平行化。对于剖析操作员费用,现有的培训系统和机器学习管道,或者对每个操作器实际执行,或者用一个规模化的投入高压来估计记忆用量,这些投入往往耗时和误导。与现有方法相比,MAP提供了一个易于使用的象征性描述器,以生成记忆和计算任意的PyTorrch模型的统计,而其时间成本微不足道,因此,这将提高MLL开发商的生产率。此外,MAP还可以完美地加快PyTorrch计算图上的不同静态规划任务,并且只需要对用户代码进行几行修改,以生成新的模块代码,而这往往耗时费时间和误导。 MACAFSO/CSODSDLSODSODLSUDSO