Driving behavior modeling is of great importance for designing safe, smart, and personalized autonomous driving systems. In this paper, an internal reward function-based driving model that emulates the human's internal decision-making mechanism is utilized. To infer the reward function from naturalistic human driving data, we propose a structural assumption about human driving behavior that focuses on discrete latent driving intentions. It converts the continuous behavior modeling problem to a discrete setting and thus makes maximum entropy inverse reinforcement learning (IRL) tractable to learn reward functions. Specifically, a polynomial trajectory sampler is adopted to generate candidate trajectories considering high-level intentions and approximate the partition function in the maximum entropy IRL framework, and an environment model considering interactive behaviors among the ego and surrounding vehicles is built to better estimate the generated trajectories. The proposed method is applied to learn personalized reward functions for individual human drivers from the NGSIM highway dataset. The qualitative results demonstrate that the learned reward function is able to explicitly express the preferences of different drivers and interpret their decisions. The quantitative results reveal that the learned reward function is robust, which is manifested by only a marginal decline in proximity to the human driving trajectories when applying the reward function in the testing conditions. For the testing performance, the personalized modeling method outperforms the general modeling approach, reducing the modeling errors in human likeness (a custom metric to gauge accuracy) by 23%, and these two methods deliver better results compared to other baseline methods.
翻译:驾驶行为模型对于设计安全、智能和个性化自主驾驶系统非常重要。 在本文中,使用了一个模仿人类内部决策机制的内部奖赏功能驱动模型。 为了从自然人类驾驶数据中推断奖励功能, 我们提出了一个关于人类驾驶行为的结构假设, 重点是离散的潜在驾驶意图。 它将连续的行为模型问题转换成一个离散的设置, 从而使得最大的反向反向强化学习( IRL) 能够学习奖赏功能。 具体地说, 采用了一个多元轨迹取样器, 以产生候选人的轨迹, 以考虑高水平的准确性, 并接近人类内部决策机制的内部决策机制。 为了从自然人驾驶数据数据中推断出一个奖励功能, 我们用一个环境模型模型来更好地估计人驾驶行为, 将连续的行为模型转换为个人驾驶者的个人奖赏功能。 质量结果显示, 学习的奖赏功能能够明确表达不同司机的偏好, 并解释他们的决定。 量化结果显示, 比如, 在最高温度框架框架中, 学习的奖赏功能是稳健, 测试, 仅表现为普通的 23 标准 测试 。