In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. In the decentralized setting, in which workers communicate over a sparse graph, current theory fails to capture important aspects of real-world behavior. First, the `spectral gap' of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence dynamics in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies. This paper is an extension of the conference paper by Vogels et. al. (2022). Code: https://github.com/epfml/topology-in-decentralized-learning.
翻译:在对机器学习模型进行数据-平行优化方面,工人协作改进模型的估计:更精确的梯度允许他们使用更大的学习率和更快地优化。在分散化的环境中,工人通过稀疏的图表交流,当前的理论无法捕捉现实世界行为的重要方面。首先,通信图的“光谱差距”不能预测其在(深)学习方面的经验表现。第二,目前的理论不能解释合作能够带来比培训本身更大的学习率。事实上,它规定了较低的学习率,这些学习率随着图表而进一步下降,而图表变得更大,无法在无限的图表中解释趋同的动态。本文旨在描绘与分散的分布优化的准确图象。我们量化了图形表层学如何影响四边形群问题的趋同,并为一般的平稳和(强烈的)矩形目标提供了理论结果。我们的理论与深层次学习的经验观察相匹配,并准确地描述了不同图表表理学的相对优点。本文是Vogels等人(2022年)的会议文件的延伸部分。代码:https://github.com/epfml/stop-custricricalization-calization-calization。