Challenging problems of deep reinforcement learning systems with regard to the application on real systems are their adaptivity to changing environments and their efficiency w.r.t. computational resources and data. In the application of learning lane-change behavior for autonomous driving, agents have to deal with a varying number of surrounding vehicles. Furthermore, the number of required transitions imposes a bottleneck, since test drivers cannot perform an arbitrary amount of lane changes in the real world. In the off-policy setting, additional information on solving the task can be gained by observing actions from others. While in the classical RL setup this knowledge remains unused, we use other drivers as surrogates to learn the agent's value function more efficiently. We propose Surrogate Q-learning that deals with the aforementioned problems and reduces the required driving time drastically. We further propose an efficient implementation based on a permutation-equivariant deep neural network architecture of the Q-function to estimate action-values for a variable number of vehicles in sensor range. We show that the architecture leads to a novel replay sampling technique we call Scene-centric Experience Replay and evaluate the performance of Surrogate Q-learning and Scene-centric Experience Replay in the open traffic simulator SUMO. Additionally, we show that our methods enhance real-world applicability of RL systems by learning policies on the real highD dataset.
翻译:在实际系统应用方面,深强化学习系统在实际系统应用方面的挑战性问题是,这些系统适应不断变化的环境和效率(w.r.t.)计算资源和数据。在应用学习车道改变行为方面,代理人必须处理不同数量的周围车辆。此外,所需的过渡次数造成瓶颈,因为测试驱动者无法在现实世界中任意进行大量车道变化。在脱离政策的环境中,通过观察他人的行动,可以获得更多关于解决任务的信息。在古典RL设置中,这一知识仍未被使用,但我们使用其他驱动者作为代孕者来更有效率地学习代理人的价值功能。我们提议进行“超动门”学习,以处理上述问题,并大幅度缩短所需驾驶时间。我们还提议在“超动门”深度电路网结构的基础上,高效地实施“调整-不动”的深层神经网络结构,以估计在传感器范围内可变量的车辆的行动价值。我们指出,该结构导致一种新型的重新使用取样技术,我们称之为“中枢”体验“重新游戏”的操作性操作功能。我们建议“超动”学习“超动”系统的实际学习方法,通过“超动”系统显示“超动”的“超动”系统学习。我们“超常”系统。我们“超常”的“超时,显示“超动”的“超动”的“超动”系统学习”系统。