With the adoption of autonomous vehicles on our roads, we will witness a mixed-autonomy environment where autonomous and human-driven vehicles must learn to co-exist by sharing the same road infrastructure. To attain socially-desirable behaviors, autonomous vehicles must be instructed to consider the utility of other vehicles around them in their decision-making process. Particularly, we study the maneuver planning problem for autonomous vehicles and investigate how a decentralized reward structure can induce altruism in their behavior and incentivize them to account for the interest of other autonomous and human-driven vehicles. This is a challenging problem due to the ambiguity of a human driver's willingness to cooperate with an autonomous vehicle. Thus, in contrast with the existing works which rely on behavior models of human drivers, we take an end-to-end approach and let the autonomous agents to implicitly learn the decision-making process of human drivers only from experience. We introduce a multi-agent variant of the synchronous Advantage Actor-Critic (A2C) algorithm and train agents that coordinate with each other and can affect the behavior of human drivers to improve traffic flow and safety.
翻译:随着在公路上采用自治车辆,我们将看到一个混合自治环境,在这个环境中,自主和人类驱动的车辆必须学会通过共享相同的道路基础设施而共存;为了实现社会可取的行为,必须指示自治车辆在决策过程中考虑周围其他车辆的效用;特别是,我们研究自治车辆的机动规划问题,并调查分散奖励结构如何诱发其行为的利他主义,并激励其为其他自主和人类驱动的车辆的利益负责;这是一个具有挑战性的问题,因为人驾驶员与自主车辆合作的意愿模糊不清。因此,与目前依靠人驾驶员行为模式的工程不同,我们采取端对端办法,让自治代理人仅从经验中隐含地学习人驾驶员的决策进程。我们引入了同步的A2C-Atvantage Acor-Critic(A2C)算法和训练代理人的多剂变式,这些算法与他人协调,并可能影响人驾驶员改善交通流量和安全的行为。