Cooperative trajectory planning methods for automated vehicles can solve traffic scenarios that require a high degree of cooperation between traffic participants. However, for cooperative systems to integrate into human-centered traffic, the automated systems must behave human-like so that humans can anticipate the system's decisions. While Reinforcement Learning has made remarkable progress in solving the decision-making part, it is non-trivial to parameterize a reward model that yields predictable actions. This work employs feature-based Maximum Entropy Inverse Reinforcement Learning combined with Monte Carlo Tree Search to learn reward models that maximize the likelihood of recorded multi-agent cooperative expert trajectories. The evaluation demonstrates that the approach can recover a reasonable reward model that mimics the expert and performs similarly to a manually tuned baseline reward model.
翻译:自动车辆合作轨迹规划方法可以解决交通流量需要交通参与者高度合作的交通情况,然而,要使合作系统融入以人为中心的交通,自动化系统必须像人一样行事,以便人类能够预测系统的决定。虽然强化学习在解决决策部分方面取得了显著进展,但将产生可预测行动的奖励模式参数化是非边际的。这项工作与蒙特卡洛树搜索结合,采用基于地物的最大反向强化学习和蒙特卡洛树搜索,学习奖励模式,最大限度地增加多剂合作专家记录轨迹的可能性。评价表明,这种方法可以恢复一种合理的奖励模式,模仿专家,并类似手动调整的基线奖励模式。