Deep learning has been used in a wide range of areas and made a huge breakthrough. With the ever-increasing model size and train-ing data volume, distributed deep learning emerges which utilizes a cluster to train a model in parallel. Unfortunately, the performance is often far from linear speedup due to the communication overhead between cluster nodes. To address this challenge, this paper designs and implements Libra, a network aggregator, that utilizes in-network computation to optimize the communication for distributed DL training in two aspects: 1) reduce active connections and 2) aggregate exchanged network packets. We implemented our Libra on Intel Tofino switches, customized a lightweight host stack and integrated it into an open-source training framework PS-lite. The experimental result shows that our Libra can achieve 1.5~4 times speedup.
翻译:深层学习被广泛使用,并取得了巨大的突破。随着模型规模和训练数据数量的不断增加,分布式深层学习出现,利用一个集群来同时训练模型。不幸的是,由于集束节点之间的通信管理,业绩往往远非线性加速。为了应对这一挑战,本文设计并使用网络集成器Libra这个网络集成器,利用网络计算优化传播DL培训的两个方面:(1) 减少主动连接;(2) 交换网络包总量。我们在Intel Tofino开关上安装了我们的Libra,定制了一个轻量主机堆,并将其纳入开放源培训框架PS-Lite。实验结果表明,我们的Libra可以实现1.5至4倍的加速。