合作轨迹规划、反强化学习和蒙特卡洛树搜索的学习奖励模式 (Learning Reward Models for Cooperative Trajectory Planning with Inverse Reinforcement Learning and Monte Carlo Tree Search)

Cooperative trajectory planning methods for automated vehicles, are capable to solve traffic scenarios that require a high degree of cooperation between traffic participants. In order for cooperative systems to integrate in human-centered traffic, it is important that the automated systems behave human-like, so that humans can anticipate the system's decisions. While Reinforcement Learning has made remarkable progress in solving the decision making part, it is non-trivial to parameterize a reward model that yields predictable actions. This work employs feature-based Maximum Entropy Inverse Reinforcement Learning in combination with Monte Carlo Tree Search to learn reward models that maximizes the likelihood of recorded multi-agent cooperative expert trajectories. The evaluation demonstrates that the approach is capable of recovering a reasonable reward model that mimics the expert and performs similar to a manually tuned baseline reward model.

翻译：自动车辆合作轨迹规划方法能够解决交通事故情况,需要交通参与者之间高度合作。为使合作系统融入以人为中心的交通,自动化系统必须像人一样行事,以便人类能够预测系统的决定。虽然强化学习在解决决策部分方面取得了显著进展,但将一个奖励模式参数化是非三重性的,它会产生可预测的行动。这项工作与蒙特卡洛树搜索公司一起,采用基于地物的最大反向强化学习,学习奖励模式,最大限度地提高多剂合作专家记录轨迹的可能性。评价表明,这种方法能够恢复一个合理的奖励模式,模仿专家,并进行类似于人工调整基线奖励模式的工作。