Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. OLPOMDP, a policy-gradient reinforcement learning algorithm, was successfully applied to simulated network routing under a number of network models. Multiple distributed agents (routers) learned co-operative behavior without explicit inter-agent communication, and they avoided behavior which was individually desirable, but detrimental to the group's overall performance. Furthermore, shaping the reward signal by explicitly penalizing certain patterns of sub-optimal behavior was found to dramatically improve the convergence rate.
翻译:网络路由是一个分布式决策问题,天然适用于数值化性能度量,例如数据包从源到目的地的平均传输时间。OLPOMDP作为一种策略梯度强化学习算法,在多种网络模型下成功应用于模拟网络路由。多个分布式智能体(路由器)在没有显式智能体间通信的情况下学会了协作行为,并避免了那些对个体有利但损害整体性能的行为。此外,通过显式惩罚某些次优行为模式来塑造奖励信号,被发现能显著提高收敛速度。