Decentralized learning over distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed. This paper focuses on improving decentralized learning over non-IID data. We propose \textit{Neighborhood Gradient Clustering (NGC)}, a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the neighbors' parameters with respect to the local dataset), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints. Further, we present \textit{CompNGC}, a compressed version of \textit{NGC} that reduces the communication overhead by $32 \times$. We demonstrate the efficiency of the proposed technique over non-IID data sampled from {various vision and language} datasets trained on diverse models, graph sizes, and topologies. Our experiments demonstrate that \textit{NGC} and \textit{CompNGC} outperform (by $0-6\%$) the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradient information available locally at each agent can improve the performance over non-IID data by $1-35\%$ without additional communication cost.
翻译:在分布式数据集上进行分散化学习,可能会在代理商之间有显著的不同数据分布。当前最先进的分散式算法 {最先进的分散式算法大多假定数据分布为独立和同义分布。本文侧重于改进非IID数据的分散化学习。我们提议了\ textit{ 邻里梯度梯度(NGC),这是一个新的分散式学习算法,它用自我和跨梯度信息来改变每个代理商的本地梯度。对于一组相邻的多种语言代理商来说,交叉级算法是另一个代理商的模型参数的衍生物 相对于其他代理商的数据集分布。特别是,拟议的方法用自我梯度、模型变异性跨梯度的加权平均值取代了模型的本地梯度(邻居参数相对于本地数据集),以及数据变异异度的跨级算法(当地模型相对于邻国数据集的典型值 ) 。 数据变异变异性基的跨位模型通过额外的沟通方式进行汇总, 变异性数据演示了我们现有的数据变异性变式数据。