ACE-Sync：一种面向通信高效大规模分布式模型训练的自适应云边同步框架 (ACE-Sync: An Adaptive Cloud-Edge Synchronization Framework for Communication-Efficient Large-Scale Distributed Model Training)

Large-scale deep learning models impose substantial communication overh ead in distributed training, particularly in bandwidth-constrained or heterogeneous clo ud-edge environments. Conventional synchronous or fixed-compression techniques o ften struggle to balance communication cost, convergence stability, and model accura cy. To address these challenges, we propose ACE-Sync, an Adaptive Cloud-Edge Sy nchronization Framework that integrates (1) an attention-based gradient importance p redictor, (2) a differentiated parameter compression strategy, and (3) a hierarchical cl oud-edge coordination mechanism. ACE-Sync dynamically selects which parameter groups to synchronize and determines appropriate compression levels under per-devic e bandwidth budgets. A knapsack-based optimization strategy is adopted to maximize important gradient preservation while reducing redundant communication. Furthermo re, residual-based error compensation and device clustering ensure long-term converg ence and cross-device personalization. Experiments show that ACE-Sync substantiall y reduces communication overhead while maintaining competitive accuracy. Compar ed with FullSync, ACE-Sync lowers communication cost from 112.5 GB to 44.7 GB (a 60% reduction) and shortens convergence from 41 to 39 epochs. Despite aggressiv e communication reduction, ACE-Sync preserves high model quality, achieving 82. 1% Top-1 accuracy-only 0.3% below the full-synchronization baseline-demonstrating its efficiency and scalability for large-scale distributed training. These results indicate that ACE-Sync provides a scalable, communication-efficient, and accuracy-preservin g solution for large-scale cloud-edge distributed model training.

翻译：大规模深度学习模型在分布式训练中带来了巨大的通信开销，尤其在带宽受限或异构的云边环境中。传统的同步或固定压缩技术往往难以平衡通信成本、收敛稳定性和模型精度。为应对这些挑战，我们提出了ACE-Sync，一种自适应云边同步框架，该框架集成了（1）基于注意力的梯度重要性预测器，（2）差异化参数压缩策略，以及（3）分层云边协调机制。ACE-Sync动态选择需要同步的参数组，并在各设备带宽预算下确定适当的压缩级别。采用基于背包问题的优化策略，以在减少冗余通信的同时最大化重要梯度的保留。此外，基于残差的误差补偿和设备聚类确保了长期收敛和跨设备个性化。实验表明，ACE-Sync在保持竞争力的精度的同时，显著降低了通信开销。与FullSync相比，ACE-Sync将通信成本从112.5 GB降低至44.7 GB（减少60%），并将收敛轮数从41轮缩短至39轮。尽管进行了激进的通信削减，ACE-Sync仍保持了较高的模型质量，实现了82.1%的Top-1准确率——仅比全同步基线低0.3%——这证明了其在大规模分布式训练中的高效性和可扩展性。这些结果表明，ACE-Sync为大规模云边分布式模型训练提供了一种可扩展、通信高效且能保持精度的解决方案。