Scaling DNNs is shown to deliver dramatic quality gains across ML problems. This, however, has also led to a concomitant quadratic increase in computation cost. To tackle this, along with the failure of accelerator memory capacity to keep up, training these models increasingly relies on distributed training techniques. As such, an important question of interest is: how will compute and communication relatively scale as models scale and hardware evolves? A careful study which answers this question can better guide the design of future systems. To this end, this work provides a comprehensive multi-axial (algorithmic, empirical, hardware evolution) analysis of compute vs. communication (Comp-vs.-Comm) scaling for future Transformer models on future hardware. Using algorithmic analysis we show that compute generally enjoys an edge over communication as models scale. However, when viewed through the lens of slower memory capacity scaling, these trends are being stressed. Next, we craft an empirical strategy to study Comp-vs.-Comm scaling for future models/hardware using existing hardware. This allows hundreds of future models/hardware scenarios to be studied at three orders of magnitude lower profiling costs. Our experiments demonstrate that communication will be a significant portion (about 40-75%) of execution as models and hardware evolve, and communication which is today hidden by overlapped computation will likely get exposed. Further, the generality of our strategy makes it a strong basis to perform Comp-vs.-Comm scaling analysis for any future model. Overall, this work underscores the increasingly large role communication will play as models scale.
翻译:显示缩放 DNN 能够给 ML 问题带来巨大的质量增益。 但是, 这项工作也导致计算成本的递增。 要解决这个问题, 加上加速存储能力无法跟上未来硬件的变换模型, 培训这些模型越来越依赖分布式培训技术。 因此, 一个重要的关注问题是: 随着模型规模和硬件的演变, 如何计算和传播相对规模? 仔细研究这一问题的答案可以更好地指导未来系统的设计。 为此, 这项工作提供了对计算成本与未来硬件的变换模型( comp- v.- Comm) 进行综合多轴( 高级、 经验、 硬件进化) 分析。 为了解决这个问题, 计算未来变换模型( Comp- v.- comfrical commation) 进行综合分析。 使用算法分析显示, 计算模型通常比通信规模小得多, 但是, 当从记忆能力缩放角度观察时, 这些趋势会受到强调。 我们的实验将研究 Common- v.- Combal 缩放未来模型/ 硬软件的缩略缩略。 这将使得未来数的大型模型在三个级别上研究, 递增的计算成本。