分散交通信号控制应用的具有内在动力的强化学习 (Meta Variationally Intrinsic Motivated Reinforcement Learning for Decentralized Traffic Signal Control)

The goal of traffic signal control is to coordinate multiple traffic signals to improve the traffic efficiency of a district or a city. In this work, we propose a novel Meta Variationally Intrinsic Motivated (MetaVIM) RL method, and aim to learn the decentralized polices of each traffic signal only conditioned on its local observation. MetaVIM makes three novel contributions. Firstly, to make the model available to new unseen target scenarios, we formulate the traffic signal control as a meta-learning problem over a set of related tasks. The train scenario is divided as multiple partially observable Markov decision process (POMDP) tasks, and each task corresponds to a traffic light. In each task, the neighbours are regarded as an unobserved part of the state. Secondly, we assume that the reward, transition and policy functions vary across different tasks but share a common structure, and a learned latent variable conditioned on the past trajectories is proposed for each task to represent the specific information of the current task in these functions, then is further brought into policy for automatically trade off between exploration and exploitation to induce the RL agent to choose the reasonable action. In addition, to make the policy learning stable, four decoders are introduced to predict the received observations and rewards of the current agent with/without neighbour agents' policies, and a novel intrinsic reward is designed to encourage the received observation and reward invariant to the neighbour agents. Empirically, extensive experiments conducted on CityFlow demonstrate that the proposed method substantially outperforms existing methods and shows superior generalizability.

翻译：交通信号控制的目标是协调多个交通信号,以提高一个地区或城市的交通效率。在这项工作中,我们建议采用新的MetaVIM(MetaVIM)RL(MetaVIM)方法,目的是学习每个交通信号的分散政策,仅以当地观察为条件。MetaVIM做出了三项新的贡献。首先,为了将模式用于新的看不见的目标情景,我们将交通信号控制作为一组相关任务的一个元学习问题。火车情景分为多部分可观测的Markov决定程序(POMDP)任务,每个任务对应一个交通灯光。在每项任务中,邻居被视为国家中一个未观测的部分。第二,我们假设奖励、过渡和政策功能因不同任务而异,但具有共同结构,以及根据以往轨迹所学习的潜在变量,为每项任务代表当前任务中的拟议具体信息,然后进一步引入勘探和开发之间自动交易的政策,以促使RL代理选择当前观察的原动力。此外,现行政策、现行奖励机制的稳定性是向当前代理机构学习,现行奖励和新动力的高级规则。