Cloud datacenters are growing exponentially both in number and size. As communication protocols evolve, datacenter networks experience higher utilization, leading to greater congestion along with increased latency and packet loss. We analyze a recently published reinforcement learning congestion control algorithm (Tessler et al. 2022) that achieves state-of-the-art performance and, in a second phase, reshape it to comply with current hardware limitations. We show how to map complex policies to a low-compute architecture, gaining x500 latency reduction. This transformation enables real-time policy inference within the $\mu$sec decision time requirement, with a negligible effect on the quality of the policy. We deploy the transformed policy onto NVIDIA NICs in an operational network. Compared to popular CC algorithms used in production, we show that RL-CC is the only one to perform well on all benchmarks tested, balancing multiple metrics simultaneously: bandwidth, latency, and packet drops. This sheds light on the feasibility of data-driven methods for congestion control, challenging the prior belief that handcrafted heuristics are required to obtain a stable and fair solution.
翻译:随着通信协议的发展,数据中心网络的利用率会提高,从而导致更大的拥堵,同时增加潜伏和包装损失。我们分析了最近出版的强化学习拥堵控制算法(Tessler等人,2022年),该算法能够达到最新水平的性能,并在第二阶段将其改造,使之符合当前的硬件限制。我们展示了如何将复杂政策绘制成一个低配置结构,获得x500升降。这种转变使得在$\mu$sec决定时间要求范围内实时政策推论成为现实,对政策的质量影响很小。我们把经过改变的政策运用到一个操作网络的NVIDIANIcs上。与生产中使用的流行CC算法相比,我们显示只有RL-CC(CC)才能在所测试的所有基准上很好地运行,同时平衡多个计量:带宽度、宽度和袋滴。这暴露了数据驱动的阻塞控制方法的可行性,挑战了先前的信念,即手制肝脏需要稳定和公平的解决办法。