We approach the task of network congestion control in datacenters using Reinforcement Learning (RL). Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. Evidently, the most popular recent deployments rely on rule-based heuristics that are tested on a predetermined set of benchmarks. Consequently, these heuristics do not generalize well to newly-seen scenarios. Contrarily, we devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We overcome challenges such as partial-observability, non-stationarity, and multi-objectiveness. We further propose a policy gradient algorithm that leverages the analytical structure of the reward function to approximate its derivative and improve stability. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training. Our experiments, conducted on a realistic simulator that emulates communication networks' behavior, exhibit improved performance concurrently on the multiple considered metrics compared to the popular algorithms deployed today in real datacenters. Our algorithm is being productized to replace heuristics in some of the largest datacenters in the world.
翻译:我们利用强化学习(RL)来应对数据中心的网络拥堵控制任务。成功的拥堵控制算法可以极大地改善长期和整个网络吞吐量。直到今天,还没有这种基于学习的算法在这一领域显示出实际潜力。很显然,最近最受欢迎的部署方法依靠的是一套事先确定的基准测试的基于规则的休养法。因此,这些休养法并不普遍适用于新出现的情况。相反,我们设计了一种基于RL的算法,目的是推广现实世界数据中心网络的不同配置。我们克服了部分可观察性、非静止性和多目标性等挑战。我们进一步建议了一种政策梯度算法,利用奖励功能的分析结构来近似其衍生物并改进稳定性。我们表明,这个方法优于其他流行的受人欢迎的RL方法,并概括了在培训期间所看不到的情景。我们用一个现实的模拟器进行了实验,以推广真实的通信网络行为。我们同时展示了多种考虑的计量方法的性业绩,与当今世界最大数据中我们所部署的成型算法中的一些受人数据相比。