More than 70% of cloud computing is paid for but sits idle. A large fraction of these idle compute are cheap CPUs with few cores that are not utilized during the less busy hours. This paper aims to enable those CPU cycles to train heavyweight AI models. Our goal is against mainstream frameworks, which focus on leveraging expensive specialized ultra-high bandwidth interconnect to address the communication bottleneck in distributed neural network training. This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We build upon the adaptive sparse training framework introduced by the SLIDE algorithm. By carefully deploying sparsity over distributed nodes, we demonstrate several orders of magnitude faster model parallel training than Horovod, the main engine behind most commercial software. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect. Moreover, the training time is at par with some of the best hardware accelerators.
翻译:超过70%的云计算是支付70%以上的云计算,但闲置。这些闲置的计算大部分是廉价的CPU, 其核心数在不太繁忙的时段里没有被利用。 本文旨在让这些CPU周期能够培训重量级的AI模型。 我们的目标是反对主流框架, 主流框架的重点是利用昂贵的特高带宽互连, 以解决分布式神经网络培训中的通信瓶颈问题。 本文展示一个分布式的模型平行培训框架, 用于在小型CPU集群中以低互联网带宽培训大型神经网络。 我们以SLIDE 算法引入的适应性稀疏培训框架为基础。 通过在分布式节点上谨慎部署宽度, 我们展示了比大多数商业软件背后的主要引擎Horovod 更快的数量级的平行模型培训。 我们显示,由于通信量减少, 由于宽度, 我们可以在简单的4-16核心CPU节点上培训近10亿个参数模型, 通过基本的低频带互连通连接。 此外, 培训时间与一些最佳硬件加速器相当。