Distributed pre-training of large models at scale often imposes heavy memory demands on individual nodes and incurs significant intra-node communication costs. We propose a novel alternative approach that reduces the memory requirements by training small, structured subnetworks of the model on separate workers. Unlike pipelining, our method avoids inter-node activation communication and maintains bandwidth requirements that are comparable to or lower than standard data parallel communication schemes based on all-reduce. We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter across the distributed training setup. Our results show that the stochastic block dropping technique consistently outperforms the width-wise subnetwork construction previously explored in federated learning. We empirically attribute this superior performance to stronger gradient alignment in subnetworks that retain blocks having skip connections. Preliminary experiments highlight the promise of our approach, achieving a 20-40% reduction in memory usage without any loss in performance.
翻译:大规模分布式预训练大型模型通常对单个节点施加沉重的内存需求,并产生显著的节点内通信开销。我们提出了一种新颖的替代方法,通过在独立的工作节点上训练模型的小型结构化子网络来降低内存需求。与流水线并行不同,我们的方法避免了节点间的激活值通信,并保持了与基于全归约的标准数据并行通信方案相当或更低的带宽要求。我们评估了两种子网络构建策略,其指导原则是确保每个参数在分布式训练设置中具有均匀的表征。我们的结果表明,随机块丢弃技术始终优于先前在联邦学习中探索的宽度方向子网络构建方法。我们通过实验将这种优越性能归因于保留了具有跳跃连接块的子网络中更强的梯度对齐。初步实验凸显了我们方法的潜力,在性能无任何损失的情况下实现了内存使用量减少20-40%。