Decentralized learning over distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed. This paper focuses on improving decentralized learning over non-IID data. We propose \textit{Neighborhood Gradient Clustering (NGC)}, a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the neighbors' parameters with respect to the local dataset), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints. Further, we present \textit{CompNGC}, a compressed version of \textit{NGC} that reduces the communication overhead by $32 \times$. We theoretically analyze the convergence rate of the proposed algorithm and demonstrate its efficiency over non-IID data sampled from {various vision and language} datasets trained. Our experiments demonstrate that \textit{NGC} and \textit{CompNGC} outperform (by $0-6\%$) the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradient information available locally at each agent can improve the performance over non-IID data by $1-35\%$ without additional communication cost.
翻译:在分布式数据集上进行分散化学习,可能会在代理商之间有显著的不同数据分布。当前最先进的分散式算法大多假定数据分布为独立和同义分布。本文侧重于改进非IID数据的分散化学习。 我们提议了\ textit{ 邻里梯度梯度梯度(NGC),这是一个新的分散化学习算法,它用自我和跨梯度信息来修改每个代理商的本地梯度。一对相邻代理商的跨梯度是另一个代理商模型参数的衍生物 相对于其他代理商的数据集的衍生物 。特别是,拟议的方法用自我梯度、模型变异性跨梯度的加权平均值取代了该模型的本地梯度(邻居参数相对于本地数据集的衍生值),以及数据变量跨梯度的跨梯度(当地模型相对于其邻居的版本, 跨级的版本) 。 我们的跨梯度数据递增的跨数数据 正在通过经过一个经训练的ODG 数据递增的内压 数据 度 显示我们的内存率 。