通过在多机构运动会中推断其前端情报水平来学习人类奖赏:运用驱动数据的一个理论理论方法 (Learning Human Rewards by Inferring Their Latent Intelligence Levels in Multi-Agent Games: A Theory-of-Mind Approach with Application to Driving Data)

Reward function, as an incentive representation that recognizes humans' agency and rationalizes humans' actions, is particularly appealing for modeling human behavior in human-robot interaction. Inverse Reinforcement Learning is an effective way to retrieve reward functions from demonstrations. However, it has always been challenging when applying it to multi-agent settings since the mutual influence between agents has to be appropriately modeled. To tackle this challenge, previous work either exploits equilibrium solution concepts by assuming humans as perfectly rational optimizers with unbounded intelligence or pre-assigns humans' interaction strategies a priori. In this work, we advocate that humans are bounded rational and have different intelligence levels when reasoning about others' decision-making process, and such an inherent and latent characteristic should be accounted for in reward learning algorithms. Hence, we exploit such insights from Theory-of-Mind and propose a new multi-agent Inverse Reinforcement Learning framework that reasons about humans' latent intelligence levels during learning. We validate our approach in both zero-sum and general-sum games with synthetic agents and illustrate a practical application to learning human drivers' reward functions from real driving data. We compare our approach with two baseline algorithms. The results show that by reasoning about humans' latent intelligence levels, the proposed approach has more flexibility and capability to retrieve reward functions that explain humans' driving behaviors better.

翻译：奖赏功能是一种奖励性代表,它承认人类的机能,并使人类的行动合理化。奖赏功能作为一种奖励性代表,特别需要模拟人类在人与机器人互动中的行为。反强化学习是取回示威中奖赏功能的有效方法。但是,在将它应用于多试剂环境时,它总是具有挑战性,因为代理人之间的相互影响必须适当地模型化。为了应对这一挑战,以前的工作要么利用平衡解决方案概念,先验地假定人类是完全合理的优化者,拥有不受限制的情报,或预先指派人的交互战略。在这项工作中,我们主张在推理他人的决策过程时,人类是受约束的,并且具有不同的情报水平,而且这种内在和潜在特征应该体现在奖励学习算法中。因此,我们从理论中汲取了这种深刻的洞察力,并提出一个新的多试剂强化学习框架,说明人类在学习过程中的潜在情报水平的原因。我们确认我们在与合成代理人的零和总和总和总和游戏中所采用的方法,并表明在从实际驱动数据中学习人的奖赏功能时,实际应用不同的情报水平,这种内在和潜值特征特征特征特征特征应该被计入。我们用两种基本的推算方法来解释人类的推算方法可以更精确地解释人类的推算方法。我们用两种推算方法将人类的推算方法将人类的推算。