The recent proliferation of the research on multi-agent deep reinforcement learning (MDRL) offers an encouraging way to coordinate multiple connected and automated vehicles (CAVs) to pass the intersection. In this paper, we apply a value decomposition-based MDRL approach (QMIX) to control various CAVs in mixed-autonomy traffic of different densities to efficiently and safely pass the non-signalized intersection with fairish fuel consumption. Implementation tricks including network-level improvements, Q value update by TD ($\lambda$), and reward clipping operation are added to the pure QMIX framework, which is expected to improve the convergence speed and the asymptotic performance of the original version. The efficacy of our approach is demonstrated by several evaluation metrics: average speed, the number of collisions, and average fuel consumption per episode. The experimental results show that our approach's convergence speed and asymptotic performance can exceed that of the original QMIX and the proximal policy optimization (PPO), a state-of-the-art reinforcement learning baseline applied to the non-signalized intersection. Moreover, CAVs under the lower traffic flow controlled by our method can improve their average speed without collisions and consume the least fuel. The training is additionally conducted under the doubled traffic density, where the learning reward converges. Consequently, the model with maximal reward and minimum crashes can still guarantee low fuel consumption, but slightly reduce the efficiency of vehicles and induce more collisions than the lower-traffic counterpart, implying the difficulty of generalizing RL policy to more advanced scenarios.
翻译:近期多试剂深层强化学习(MDRL)研究的激增为协调多联结和自动化车辆(CAVs)以通过十字路口提供了一个令人鼓舞的方式,在本文件中,我们采用了基于价值分解的MDRL方法(QMIX)来控制不同密度混合自动交通中的各种CAVs,以高效和安全地通过非信号化交叉点与公平燃料消耗的交叉点。执行技巧包括网络级改进、以TD(lambda$)更新Q值以及奖励剪贴作业被添加到纯QMIX框架中,预计这将提高交汇速度和原始版本的无信号性能性能。我们的方法的效力体现在几个评价指标上:平均速度、碰撞次数和每集成平均燃料消耗量中的平均速度,实验结果表明,我们的方法的趋同速度可以超过最初的QMIX和准度政策优化(PPPO),在非信号化车辆的递合速度中应用的状态强化基线,可以降低死亡率,在控制性交通中,以最低的节流下,在燃料循环中进行更多的学习。