Scaling DNNs is shown to deliver dramatic quality gains across ML problems. This, however, has also led to a concomitant quadratic increase in computation cost. To tackle this, along with the failure of accelerator memory capacity to keep up, training these models increasingly relies on distributed training techniques. As such, an important question of interest is: how will compute and communication relatively scale as models scale and hardware evolves? A careful study which answers this question can better guide the design of future systems. To this end, this work provides a comprehensive multi-axial (algorithmic, empirical, hardware evolution) analysis of compute vs. communication ($\textbf{Comp-vs.-Comm}$) scaling for future Transformer models on future hardware. Using algorithmic analysis we show that compute generally enjoys an edge over communication as models scale. However, when viewed through the lens of slower memory capacity scaling, these trends are being stressed. Next, we craft an empirical strategy to study Comp-vs.-Comm scaling for future models/hardware using existing hardware. This allows hundreds of future models/hardware scenarios to be studied at three orders of magnitude lower profiling costs. Our experiments demonstrate that communication will be a significant portion (about 40-75%) of execution as models and hardware evolve, and communication which is today hidden by overlapped computation will likely get exposed. Further, the generality of our strategy makes it a strong basis to perform Comp-vs.-Comm scaling analysis for any future model. Overall, this work underscores the increasingly large role communication will play as models scale.
翻译:缩放 DNN 显示在ML 问题中带来巨大的质量增益。 但是, 这项工作也导致计算成本的附带四倍增长。 要解决这个问题, 加上加速存储能力无法跟上, 培训这些模型越来越依赖分布式培训技术。 因此, 一个重要的关注问题是: 随着模型规模和硬件的演变, 如何计算和传播相对规模? 仔细研究这个问题的答案可以更好地指导未来系统的设计。 为此, 这项工作提供了对计算成本的计算( 高级、 经验、 硬件进化) 的全面多重轴( 高级、 实验、 硬件进化) 分析。 为了解决这个问题, 将在未来的变换模型/硬件模型与通信( textlebf{Comp- v.- Commall}$) 的计算能力都无法跟上。 我们的算法分析显示, 如何计算相对而言, 如何随着记忆能力缩放速度的缩放, 这些趋势将会受到强调。 我们的模型/ Compal 缩放战略的缩略图将使得未来数模型/硬体模型/硬体模型的缩缩略图 将在未来的演化。