The training efficiency and scalability of language models on massive clusters currently remain a critical bottleneck. Mainstream approaches like ND parallelism are often cumbersome and complex, while flexible alternatives such as the Zero Redundancy Optimizer (ZeRO) are frequently hampered by communication overhead. In this paper, we propose Asynchronous Hierarchical Zero Parallelism (AsyncHZP), a novel asynchronous variant of ZeRO designed to achieve superior performance while maintaining simplicity and memory efficiency. Unlike traditional ZeRO, which employs over-fine-grained sharding that can lead to inefficient communication, AsyncHZP adaptively reshards parameters, gradients, and optimizer states across different replica groups. This strategy optimizes device memory utilization and significantly reduces communication overhead. In addition, we also design a multi-stream asynchronous scheduling method that executes parameter all-gather and gradient reduce-scatter operations in dedicated background threads, effectively overlapping communication with computation while incurring negligible memory fragmentation. Empirical evaluations on both Dense and Mixture-of-Experts (MoE) models confirm that AsyncHZP maintains robust stability at scale. It consistently outperforms classic ND parallelism, achieving state-of-the-art performance without complex strategic tuning, thereby simplifying the path to efficient large-scale training.
翻译:当前,大规模集群上语言模型的训练效率与可扩展性仍是一个关键瓶颈。主流方法如ND并行通常繁琐复杂,而灵活替代方案如零冗余优化器(ZeRO)则常受通信开销制约。本文提出异步分层零冗余并行(AsyncHZP),这是一种新颖的ZeRO异步变体,旨在保持简洁性与内存效率的同时实现卓越性能。与传统ZeRO采用可能导致低效通信的过度细粒度分片不同,AsyncHZP在不同副本组间自适应地重分片参数、梯度和优化器状态。该策略优化了设备内存利用率并显著降低了通信开销。此外,我们还设计了一种多流异步调度方法,在专用后台线程中执行参数全收集与梯度规约分散操作,有效实现了通信与计算的重叠,同时仅引入可忽略的内存碎片。在稠密模型与混合专家(MoE)模型上的实证评估证实,AsyncHZP在大规模训练中保持了稳健的稳定性。其性能持续超越经典ND并行,无需复杂策略调优即可达到先进水平,从而简化了高效大规模训练的路径。