Current deep learning (DL) systems rely on a centralized computing paradigm which limits the amount of available training data, increases system latency, and adds privacy and security constraints. On-device learning, enabled by decentralized and distributed training of DL models over peer-to-peer wirelessly connected edge devices, not only alleviate the above limitations but also enable next-gen applications that need DL models to continuously interact and learn from their environment. However, this necessitates the development of novel training algorithms that train DL models over time-varying and directed peer-to-peer graph structures while minimizing the amount of communication between the devices and also being resilient to non-IID data distributions. In this work we propose, Sparse-Push, a communication efficient decentralized distributed training algorithm that supports training over peer-to-peer, directed, and time-varying graph topologies. The proposed algorithm enables 466x reduction in communication with only 1% degradation in performance when training various DL models such as ResNet-20 and VGG11 over the CIFAR-10 dataset. Further, we demonstrate how communication compression can lead to significant performance degradation in-case of non-IID datasets, and propose Skew-Compensated Sparse Push algorithm that recovers this performance drop while maintaining similar levels of communication compression.
翻译:目前深层学习(DL)系统依赖于一种中央化的计算模式,这种模式限制现有培训数据的数量,增加系统的延缓性,并增加隐私和安全限制。 在线学习,通过对等对等无线连接边缘设备对DL模型进行分散和分散的培训,不仅减轻上述限制,而且使需要DL模型的下一代应用能够不断互动并从环境中学习。然而,这需要开发新的培训算法,在时间变化和引导对等平方图结构中培训DL模型,同时最大限度地减少设备之间的通信量,同时适应非IID数据分布。在这项工作中,我们提议,Sprass-Push,一种高效的分散式传播培训算法,支持对等对等对等方培训、定向和时间变化的图形表层。提议的算法使得在培训各种DL模型,如ResNet-20和VGG11在CIFAR-10数据集中只减少1%的通信性能退化。此外,我们证明通信压缩如何导致显著的性能退化,同时提议SK-II系统-Sqrassimal 的性平流数据级恢复。