The past decade has witnessed a rapid expansion of global cloud wide-area networks (WANs) with the deployment of new network sites and datacenters, making it challenging for commercial optimization engines to solve the network traffic engineering (TE) problem quickly at scale. Current approaches to accelerating TE optimization decompose the task into subproblems that can be solved in parallel using optimization solvers, but they are fundamentally restricted to a few dozen subproblems in order to balance run time and TE performance, achieving limited parallelism and speedup. Motivated by the ability to readily access thousands of threads on GPUs through modern deep learning frameworks, we propose a learning-based TE algorithm -- Teal, which harnesses the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and model network flows, learning flow features as inputs to the downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to allocate each traffic demand independently toward optimizing a central TE objective. Finally, Teal fine-tunes the resulting flow allocations using alternating direction method of multipliers (ADMM), a highly parallelizable constrained optimization algorithm for reducing constraint violations (e.g., overused links). We evaluate Teal on traffic matrices collected from a global cloud provider, and show that on a large WAN topology with over 1,700 nodes, Teal generates near-optimal flow allocations while being several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies up to 29% more traffic demands and yields up to 109x speedups.
翻译:过去十年来,全球云广域网迅速扩大,部署了新的网络站点和数据中心,使得商业优化引擎难以迅速解决网络交通工程问题。目前加快TE优化将任务分解为子问题的方法可以同时使用优化解决方案解决,但基本上限于几十个小问题,以平衡运行时间和TE的运行,实现有限的平行和加速。由于能够通过现代深层次学习框架随时获得数千条GPU的线索,我们提议基于学习的SE流流算法 -- -- Teal,利用GPU的平行处理能力加速TE控制。首先,Teal设计一个流动中心图形神经网络(GNNN),以捕捉WAN的连通性和模型网络流动,学习流动特征作为下游分配的投入。第二,为了降低问题规模和学习可感应力,Teal采用多层次强化学习(RL)算法,将每次交通需求都独立地用于优化核心交通目标。最后,Teal-ral-ral-ral-ral-ral-rational-ral-ral-rational-ral-rational-rational-lational-rational-reck lax a lax lax a lax lax a lax lax a lax a lax a lax a lax a lax a lax a lax lax a lax a lax a lax lax a lax lax lax ladal-tal-tradal-tradaltimaltrad-tradal-tradal-tradal-trad-tradal-tradal-tradal-tradal-traddal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tradal-tral) ladaltisttral) lax a lax lax ladal-