Reinforcement Learning (RL) is currently one of the most commonly used techniques for traffic signal control (TSC), which can adaptively adjusted traffic signal phase and duration according to real-time traffic data. However, a fully centralized RL approach is beset with difficulties in a multi-network scenario because of exponential growth in state-action space with increasing intersections. Multi-agent reinforcement learning (MARL) can overcome the high-dimension problem by employing the global control of each local RL agent, but it also brings new challenges, such as the failure of convergence caused by the non-stationary Markov Decision Process (MDP). In this paper, we introduce an off-policy nash deep Q-Network (OPNDQN) algorithm, which mitigates the weakness of both fully centralized and MARL approaches. The OPNDQN algorithm solves the problem that traditional algorithms cannot be used in large state-action space traffic models by utilizing a fictitious game approach at each iteration to find the nash equilibrium among neighboring intersections, from which no intersection has incentive to unilaterally deviate. One of main advantages of OPNDQN is to mitigate the non-stationarity of multi-agent Markov process because it considers the mutual influence among neighboring intersections by sharing their actions. On the other hand, for training a large traffic network, the convergence rate of OPNDQN is higher than that of existing MARL approaches because it does not incorporate all state information of each agent. We conduct an extensive experiments by using Simulation of Urban MObility simulator (SUMO), and show the dominant superiority of OPNDQN over several existing MARL approaches in terms of average queue length, episode training reward and average waiting time.
翻译:强化学习(RL)是目前最常用的交通信号控制(TSC)技术之一,它可以根据实时交通数据调整调整交通信号阶段和持续时间。然而,完全集中的RL方法在多网络情景下遇到了困难,因为州行动空间的指数增长,交叉点越来越多。多剂强化学习(MARL)可以通过使用每个当地RL代理器的全球控制来克服高差异问题,但也带来了新的挑战,例如非静止马可夫决策过程(MDP)导致的广泛趋同失败。在本文中,我们引入了一种离政策高度的高级水平的QNetwork(OPNDQN)算法,这缓解了完全集中和MARL方法的弱点。多剂强化学习(MARL)算法解决了在大型州行动空间交通模型中无法使用传统算法的问题,在每次试算中找到相邻接合点之间的纳什平衡点,而这种交错不鼓励单方面偏离。 QONQN的主要优势之一是,因为它在使用当前多层次的移动培训过程中,而不是通过相互连动的方式, IML 。