Traffic signal control is an important problem in urban mobility with a significant potential of economic and environmental impact. While there is a growing interest in Reinforcement Learning (RL) for traffic control, the work so far has focussed on learning through interactions which, in practice, is costly. Instead, real experience data on traffic is available and could be exploited at minimal costs. Recent progress in offline or batch RL has enabled just that. Model-based offline RL methods, in particular, have been shown to generalize to the experience data much better than others. We build a model-based learning framework, A-DAC, which infers a Markov Decision Process (MDP) from dataset with pessimistic costs built in to deal with data uncertainties. The costs are modeled through an adaptive shaping of rewards in the MDP which provides better regularization of data compared to the prior related work. A-DAC is evaluated on a complex signalized roundabout using multiple datasets varying in size and in batch collection policy. The evaluation results show that it is possible to build high performance control policies in a data efficient manner using simplistic batch collection policies.
翻译:交通信号控制是城市流动中的一个重要问题,具有巨大的经济和环境影响潜力。虽然人们越来越关注交通控制方面的强化学习(RL),但迄今为止的工作侧重于通过互动学习,而在实践中,这种互动成本很高。相反,关于交通的真实经验数据是可获得的,可以以最低的成本加以利用。最近脱机或分批的RL方面的进展就能够做到这一点。特别是,基于模型的离线RL方法比其他方法更能概括经验数据。我们建立了一个基于模型的学习框架,A-DAC,从为处理数据不确定性而建立的带有悲观成本的数据集中推算出一个Markov决定程序(MDP)。费用是通过在MDP中以适应性的方式塑造奖励模型的,与先前的相关工作相比,它提供更好的数据正规化。A-DAC在使用不同规模和批量收集政策的多个多套数据集的复杂信号化回合上进行了评估。评价结果表明,有可能利用简单分组收集政策以高效的方式建立高绩效控制政策。