This paper is concerned with minimizing the average of $n$ cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate, which we show behaves as $K_T=\mathcal{O}\left(\frac{n}{(1-\rho_w)^2}\right)$, where $1-\rho_w$ denotes the spectral gap of the mixing matrix. Moreover, we construct a "hard" optimization problem for which we show the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by $\Omega \left(\frac{n}{(1-\rho_w)^2} \right)$, implying the sharpness of the obtained result. Numerical experiments demonstrate the tightness of the theoretical results.
翻译:本文关注的是将代理商可以相互沟通和交流信息的网络的成本平均值降到最低。 我们考虑只提供噪音梯度信息的设置。 为了解决问题, 我们研究分布式随机梯度下降法( DSGD ), 并进行非无症状的趋同分析。 对于强烈的调和和和平稳客观功能, DSGD 将实现与集中式梯度下降(SGD ) 相比的最佳网络独立趋同率。 我们的主要贡献是描述 DSGD 接近无症状趋同率所需的短暂时间, 我们显示其表现为 $K_ Támathcal{ O ⁇ ázleft (\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\