Decentralized distributed learning is the key to enabling large-scale machine learning (training) on the edge devices utilizing private user-generated local data, without relying on the cloud. However, practical realization of such on-device training is limited by the communication bottleneck, computation complexity of training deep models and significant data distribution skew across devices. Many feedback-based compression techniques have been proposed in the literature to reduce the communication cost and a few works propose algorithmic changes to aid the performance in the presence of skewed data distribution by improving convergence rate. To the best of our knowledge, there is no work in the literature that applies and shows compute efficient training techniques such quantization, pruning etc., for peer-to-peer decentralized learning setups. In this paper, we analyze and show the convergence of low precision decentralized training that aims to reduce the computational complexity of training and inference. Further, We study the effect of degree of skew and communication compression on the low precision decentralized training over various computer vision and Natural Language Processing (NLP) tasks. Our experiments indicate that 8-bit decentralized training has minimal accuracy loss compared to its full precision counterpart even with heterogeneous data. However, when low precision training is accompanied by communication compression through sparsification we observe 1-2% drop in accuracy. The proposed low precision decentralized training decreases computational complexity, memory usage, and communication cost by ~4x while trading off less than a 1% accuracy for both IID and non-IID data. In particular, with higher skew values, we observe an increase in accuracy (by ~0.5%) with low precision training, indicating the regularization effect of the quantization.
翻译:分散分布式学习是利用私人用户生成的本地数据,利用私人用户生成的本地数据,在边缘设备上进行大规模机器学习(培训)的关键。然而,由于通信瓶颈、计算深层培训模型的复杂性和各种设备之间的大量数据分布,实际实现这种在线培训受到限制。文献中提出了许多基于反馈的压缩技术,以减少通信成本,有少数作品提议了算法变化,以通过提高趋同率,帮助在出现扭曲的数据分布时进行业绩。据我们所知,在文献中,没有应用和显示效率不高的培训技术,例如四分级、分级等,用于同行之间的分散式学习设置。在本文件中,我们分析并显示低精度分散式培训的趋同性,旨在降低培训和推断的计算复杂性。此外,我们研究的是,在各种计算机视觉和自然语言处理(NLP)任务中,对低精度分散式培训的影响。我们的实验表明,8位分散式培训比精度低精度低的精度更精度、低精度培训的精度降低,而我们通过精确度的精确度数据计算来测量的精确度数据计算。