Traffic signal control is an important problem in urban mobility with a significant potential of economic and environmental impact. While there is a growing interest in Reinforcement Learning (RL) for traffic signal control, the work so far has focussed on learning through simulations which could lead to inaccuracies due to simplifying assumptions. Instead, real experience data on traffic is available and could be exploited at minimal costs. Recent progress in {\em offline} or {\em batch} RL has enabled just that. Model-based offline RL methods, in particular, have been shown to generalize from the experience data much better than others. We build a model-based learning framework which infers a Markov Decision Process (MDP) from a dataset collected using a cyclic traffic signal control policy that is both commonplace and easy to gather. The MDP is built with pessimistic costs to manage out-of-distribution scenarios using an adaptive shaping of rewards which is shown to provide better regularization compared to the prior related work in addition to being PAC-optimal. Our model is evaluated on a complex signalized roundabout showing that it is possible to build highly performant traffic control policies in a data efficient manner.
翻译:交通信号控制是城市移动中的一个重要问题,具有巨大的经济和环境影响潜力。虽然人们越来越关注交通信号控制方面的强化学习(RL),但迄今为止的工作侧重于通过模拟学习,这种模拟可能由于简化的假设而导致不准确。相反,关于交通的真实经验数据是可获得的,可以以最低的成本加以利用。最近在 yem 离线 或 ~em批量 或 RL 方面取得的进展正是如此。基于模型的离线RL 方法尤其显示比其他方法更能从经验数据中概括更多。我们建立了一个基于模型的学习框架,从使用循环交通信号控制政策收集的数据集中推导出Markov 决策过程(MDP ) 。 MDP 建于悲观成本, 利用适应性收益的形状来管理分配外的情景, 事实证明, 与先前的相关工作相比,除了PAC-opatimic(PAC-timal), 我们的模型在复杂的信号环绕图上进行了评估,显示它有可能在高效的交通控制政策中建立高效的数据控制方式。