Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structured subnetworks trained across workers without exchanging activations. We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains while providing additional regularization. We further explore two subnetwork construction strategies: neuron level and block level, applied across both CNNs and transformers. In experiments spanning CNNs and transformers on CIFAR and ImageNet, as well as LLM pre-training on FineWeb, SDP reduces per-device memory usage by 30%-75% while maintaining or improving performance. Notably, in FLOP-matched settings, forward masking can sometimes achieve better performance.
翻译:大规模预训练大型神经网络对加速器内存提出了沉重需求,且通常需要昂贵的通信开销。我们提出了子网络数据并行(SDP),这是一种分布式训练框架,它将模型划分为结构化的子网络,并在不同工作节点上进行训练,无需交换激活值。我们研究了两种互补的掩码机制:反向掩码,仅在反向传播步骤应用稀疏性以保持梯度的无偏性;以及前向掩码,它还在前向传播中移除参数,从而在提供额外正则化的同时实现更强的效率提升。我们进一步探索了两种子网络构建策略:神经元级别和块级别,并应用于CNN和Transformer架构。在涵盖CIFAR和ImageNet数据集上的CNN与Transformer实验,以及FineWeb上的大语言模型预训练中,SDP将单设备内存使用量降低了30%-75%,同时保持或提升了模型性能。值得注意的是,在FLOP匹配的设置下,前向掩码有时能够实现更好的性能。