Traffic signal control is an important problem in urban mobility with a significant potential of economic and environmental impact. While there is a growing interest in Reinforcement Learning (RL) for traffic signal control, the work so far has focussed on learning through simulations which could lead to inaccuracies due to simplifying assumptions. Instead, real experience data on traffic is available and could be exploited at minimal costs. Recent progress in offline or batch RL has enabled just that. Model-based offline RL methods, in particular, have been shown to generalize from the experience data much better than others. We build a model-based learning framework which infers a Markov Decision Process (MDP) from a dataset collected using a cyclic traffic signal control policy that is both commonplace and easy to gather. The MDP is built with pessimistic costs to manage out-of-distribution scenarios using an adaptive shaping of rewards which is shown to provide better regularization compared to the prior related work in addition to being PAC-optimal. Our model is evaluated on a complex signalized roundabout showing that it is possible to build highly performant traffic control policies in a data efficient manner.
翻译:交通信号控制是城市移动中的一个重要问题,具有巨大的经济和环境影响潜力。虽然人们越来越关注交通信号控制方面的强化学习(RL),但迄今为止的工作侧重于通过模拟学习,这种模拟可能由于简化的假设而导致不准确。相反,关于交通的真实经验数据是可以得到的,并且可以以最低的成本加以利用。最近脱机或分批的RL方面的进展正是如此。特别是基于模型的脱机RL方法比其他方法更能从经验数据中概括化。我们建立了一个基于模型的学习框架,从使用常见和易于收集的循环交通信号控制政策收集的数据集中推导出一个Markov决定程序(MDP)。MDP是用悲观成本构建的,用适应性的报酬制成来管理分销情景,这显示除了PAC-最佳外,还比先前的相关工作提供更好的规范化。我们的模式是用一个复杂的信号圆环来评估的,表明它有可能以数据高效的方式建立高度运行的交通控制政策。