利用自然人类驱动数据与反强化学习进行驱动行为模拟 (Driving Behavior Modeling using Naturalistic Human Driving Data with Inverse Reinforcement Learning)

Driving behavior modeling is of great importance for designing safe, smart, and personalized autonomous driving systems. In this paper, an internal reward function-based driving model that emulates the human's decision-making mechanism is utilized. To infer the reward function parameters from naturalistic human driving data, we propose a structural assumption about human driving behavior that focuses on discrete latent driving intentions. It converts the continuous behavior modeling problem to a discrete setting and thus makes maximum entropy inverse reinforcement learning (IRL) tractable to learn reward functions. Specifically, a polynomial trajectory sampler is adopted to generate candidate trajectories considering high-level intentions and approximate the partition function in the maximum entropy IRL framework. An environment model considering interactive behaviors among the ego and surrounding vehicles is built to better estimate the generated trajectories. The proposed method is applied to learn personalized reward functions for individual human drivers from the NGSIM highway driving dataset. The qualitative results demonstrate that the learned reward functions are able to explicitly express the preferences of different drivers and interpret their decisions. The quantitative results reveal that the learned reward functions are robust, which is manifested by only a marginal decline in proximity to the human driving trajectories when applying the reward function in the testing conditions. For the testing performance, the personalized modeling method outperforms the general modeling approach, significantly reducing the modeling errors in human likeness (a custom metric to gauge accuracy), and these two methods deliver better results compared to other baseline methods.

翻译：驾驶行为模型对于设计安全、智能和个性化自主驾驶系统非常重要。在本文中,使用了一个模仿人类决策机制的内部奖赏功能驱动模型。为了从自然人类驾驶数据中推断奖励功能参数, 我们建议了一个人类驾驶行为结构假设, 重点是离散的潜在驾驶意图。它将连续的行为模型问题转换成一个离散的设置, 从而使得最大的反向反向强化学习( IRL) 能够学习奖赏功能。具体地说, 采用了一个多元轨迹取样器, 以产生候选人的轨迹, 以考虑高水平的准确性, 并接近在最大英特普的 IRL 框架中的分区功能。一个考虑自我和周围车辆之间互动行为的环境模型, 以更好地估计生成的轨迹。拟议的方法用于从NGSIM 高速公路模型模型中学习个人驾驶者个性化的奖赏功能。定性结果表明, 学习的奖赏功能能够明确表达不同司机的偏好, 并解释他们的决定。量化结果显示, 比如, 在最高级的轨迹测试中, 将学习的人类奖赏功能应用较稳性的方法, 仅性测试, 显示, 人类的比性测试, 仅度测试, 度度显示这些功能显示度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度