Decentralized learning over distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed. This paper focuses on improving decentralized learning over non-IID data. We propose \textit{Neighborhood Gradient Clustering (NGC)}, a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the neighbors' parameters with respect to the local dataset), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints. Further, we present \textit{CompNGC}, a compressed version of \textit{NGC} that reduces the communication overhead by $32 \times$. We theoretically analyze the convergence rate of the proposed algorithm and demonstrate its efficiency over non-IID data sampled from {various vision and language} datasets trained. Our experiments demonstrate that \textit{NGC} and \textit{CompNGC} outperform (by $0-6\%$) the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradient information available locally at each agent can improve the performance over non-IID data by $1-35\%$ without additional communication cost.
翻译:基于分布式数据集的去中心化学习可能会在各个节点上存在显著不同的数据分布。目前的去中心化算法大多数假设这些数据是独立同分布的。本文着眼于改进在非独立同分布数据上的去中心化学习。我们提出了一种新颖的去中心化学习算法——邻域梯度聚类(NGC),通过自身和跨梯度信息修改每个代理的本地梯度。对于相邻代理的一对,它们的跨梯度则是指一个代理的模型参数对另一个代理的数据集求导的结果。具体地,我们将模型的局部梯度替换为自身梯度、模型特征的跨梯度(一个代理的参数对于另一个代理的数据集求导的结果)和数据特征的跨梯度(每个代理的模型对于邻居数据集进行求导的结果)按权重平均的结果。聚合数据特征的跨梯度则需要在额外的通信轮次中进行,同时保护隐私。此外,我们还提出了一种名为CompNGC的压缩版本,可以降低通信开销32倍。我们对所提出的算法进行了理论收敛率分析,并在来自多个视觉和语言数据集的非独立同分布数据上验证了其高效性。实验表明,相比于现有的基于非独立同分布数据的去中心化学习算法,NGC和CompNGC在计算和内存需求方面具有更高的性能(提升0-6%)。此外,我们的实验还显示,局部可得的模型特征的跨梯度信息可以在不增加通信成本的情况下,将性能提高1-35%。