Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.
翻译:许多微工程优化为深神经网络打开了巨大的处理能力,而这反过来又为AI革命火上浇油。随着这种优化的耗尽,现代AI的增长现在被培训系统的性能,特别是其数据流动所束缚。我们不只关注单一加速器,而是调查整个系统规模大规模培训的数据移动特点。根据我们的工作量分析,我们设计了HammingMesh,这是一种新型网络地形学,它以低成本提供高带宽,高工作时间安排的灵活性很高。具体地说,HammingMesh可以支持全带宽和与深层学习培训工作的隔离,同时具有两个平行性层面。此外,它也支持全球通用交通的高带宽。因此,HammingMesh将推动未来具有极端带宽要求的大型深层学习系统。