Distributed sparse deep learning has been widely used in many internet-scale applications. Network communication is one of the major hurdles for the training performance. In-network gradient aggregation on programmable switches is a promising solution to speed up the performance. Nevertheless,existing in-network aggregation solutions are designed for the distributed dense deep training, and fall short when used for the sparse deep training.To address this gap, we present Libra based on our key observation of the extremely biased update frequency of parameters in distributed deep sparse training. Specifically, Libra offloads only the aggregation for "hot" parameters that are updated frequently onto programmable switches. To enable this offloading and achieve high aggregation throughput, we propose solutions to address the challenges related to hot parameter identification, parameter orchestration, floating-point summation on switches as well as system reliability. We implemented Libra on Intel Tofino switches and integrated it with PS-lite. Finally, we evaluate Libra's performance through extensive experiments and show that Libra can speed up the gradient aggregation by 1.5~4 times.
翻译:许多互联网应用中广泛使用分散的深层次学习方法。 网络通信是培训工作的主要障碍之一。 网络内可编程开关的梯度汇总是加速性能的一个大有希望的解决办法。 尽管如此, 网络内汇总解决方案是为分散的密集深层培训设计的, 而用于稀薄深层培训时则显得不足。 为了弥补这一差距, 我们根据我们对分布式深深度培训中极有偏差的更新参数频率的主要观察, 向利布拉展示了利布拉。 具体来说, 利布拉只卸载经常更新到可编程开关的“ 热” 参数的汇总。 为了能够卸载并实现高集速传输, 我们提出了解决与热参数识别、 参数协调、 开关的浮点和系统可靠性有关的挑战的解决方案。 我们在英特尔托菲诺开关上实施了利布拉, 并将它与PS- lite 整合为一体。 最后, 我们通过广泛的实验来评估利布拉的表现, 并显示利布拉能够将梯度聚合速度加速1.5~4倍。