学习混合电排在信号十字路口混合控制自动和人驾驶车辆的政策:随机搜索办法 (Learning the policy for mixed electric platoon control of automated and human-driven vehicles at signalized intersection: a random search approach)

The upgrading and updating of vehicles have accelerated in the past decades. Out of the need for environmental friendliness and intelligence, electric vehicles (EVs) and connected and automated vehicles (CAVs) have become new components of transportation systems. This paper develops a reinforcement learning framework to implement adaptive control for an electric platoon composed of CAVs and human-driven vehicles (HDVs) at a signalized intersection. Firstly, a Markov Decision Process (MDP) model is proposed to describe the decision process of the mixed platoon. Novel state representation and reward function are designed for the model to consider the behavior of the whole platoon. Secondly, in order to deal with the delayed reward, an Augmented Random Search (ARS) algorithm is proposed. The control policy learned by the agent can guide the longitudinal motion of the CAV, which serves as the leader of the platoon. Finally, a series of simulations are carried out in simulation suite SUMO. Compared with several state-of-the-art (SOTA) reinforcement learning approaches, the proposed method can obtain a higher reward. Meanwhile, the simulation results demonstrate the effectiveness of the delay reward, which is designed to outperform distributed reward mechanism} Compared with normal car-following behavior, the sensitivity analysis reveals that the energy can be saved to different extends (39.27%-82.51%) by adjusting the relative importance of the optimization goal. On the premise that travel delay is not sacrificed, the proposed control method can save up to 53.64% electric energy.

翻译：在过去几十年中,车辆的升级和更新工作加快了。由于需要环境友好和情报,电动车辆(EVs)以及连接和自动化车辆(CAVs)已成为运输系统的新组成部分。本文件开发了一个强化学习框架,以便在信号交汇处对由CAV和人驱动车辆(HDVs)组成的电排实施适应性控制。首先,提议了一个Markov决策程序模型来描述混合排的决策过程。新国家代表和奖励功能是为考虑整个排行为的模式设计的。第二,为了处理延迟的奖励,提议了一个增强随机搜索(ARS)算法。代理商所学的控制政策可以指导由CAVs和人驱动的车辆(HDVs)组成的电排长视运动。最后,在模拟的SUMO套件中进行了一系列模拟。与若干先进技术(SOTA)强化学习方法相比,拟议的方法可以获得更高的奖励。同时,模拟结果显示延迟的奖励效果不是延迟的奖励,而是增加随机搜索(ARS),该方法可以用来引导CAV的纵向运动运动运动运动运动运动运动运动运动运动运动运动运动运动运动运动运动运动。最后调整了SMO。与节能机制可以节省节能。