As the scale of distributed training grows, communication becomes a bottleneck. To accelerate the communication, recent works introduce In-Network Aggregation (INA), which moves the gradients summation into network middle-boxes, e.g., programmable switches to reduce the traffic volume. However, switch memory is scarce compared to the volume of gradients transmitted in distributed training. Although literature applies methods like pool-based streaming or dynamic sharing to tackle the mismatch, switch memory is still a potential performance bottleneck. Furthermore, we observe the under-utilization of switch memory due to the synchronization requirement for aggregator deallocation in recent works. To improve the switch memory utilization, we propose ESA, an $\underline{E}$fficient Switch Memory $\underline{S}$cheduler for In-Network $\underline{A}$ggregation. At its cores, ESA enforces the preemptive aggregator allocation primitive and introduces priority scheduling at the data-plane, which improves the switch memory utilization and average job completion time (JCT). Experiments show that ESA can improve the average JCT by up to $1.35\times$.
翻译:随着分布式培训规模的扩大,通信成为瓶颈。为了加快通信速度,最近的工作引入了网络内聚合(INA),将梯度和差分转换成网络中间箱,例如用于减少流量的可编程序开关。然而,与分布式培训传输的梯度数量相比,开关记忆量很少。虽然文献采用以池为基础的流流或动态共享等方法来解决不匹配问题,但开关记忆仍然是一个潜在的性能瓶颈。此外,我们观察到由于近期工程中聚合器交易位置的同步性要求,开关记忆使用不足。为了改进开关记忆的利用率,我们建议欧空局,用$\ underline{E}$ffificent 开关内存 $\underline{S}$cheduer for InNetwork $\underline{A}$gggnationation。在其核心中,欧空局实施了先发式聚合器原始分配,并在数据平板上引入了优先列表,从而改进了交换式记忆的利用和平均工作完成时间(JCT)。实验显示,欧空局可以改进平均JCT的JC。