To understand how people interact with each other in collaborative settings, especially in situations where individuals know little about their teammates, Multiagent Inverse Reinforcement Learning (MIRL) aims to infer the reward functions guiding the behavior of each individual given trajectories of a team's behavior during task performance. Unlike current MIRL approaches, team members \emph{are not} assumed to know each other's goals a priori, rather they collaborate by adapting to the goals of others perceived by observing their behavior, all while jointly performing a task. To address this problem, we propose a novel approach to MIRL via Theory of Mind (MIRL-ToM). For each agent, we first use ToM reasoning to estimate a posterior distribution over baseline reward profiles given their demonstrated behavior. We then perform MIRL via decentralized equilibrium by employing single-agent Maximum Entropy IRL to infer a reward function for each agent, where we simulate the behavior of other teammates according to the time-varying distribution over profiles. We evaluate our approach in a simulated 2-player search-and-rescue operation where the goal of the agents, playing different roles, is to search for and evacuate victims in the environment. Results show that the choice of baseline profiles is paramount to the recovery of ground-truth rewards, and MIRL-ToM is able to recover the rewards used by agents interacting with either known and unknown teammates.
翻译:为了了解人们如何在协作环境中彼此互动,特别是在个人对其团队伙伴知之甚少的情况下,多试剂反向强化学习(MIIRL)旨在推断在任务执行期间,由于团队行为轨迹的轨迹,每个个人的行为都会受到奖赏功能的指导。与当前 MIRL 方法不同,团队成员假定彼此先验地了解对方的目标,而不是通过配合,在共同执行任务的同时,通过观察他人行为来适应他人所感知的目标。为了解决这个问题,我们建议通过思维理论(MIRL-ToM)对 MIRL 采取新颖的方法。对于每个代理,我们首先使用托姆推理来估计基线奖赏特征的外观分布。然后,我们通过分散的平衡执行MIRL,方法是使用单一的代理最大Entropy IRL来推断对每个代理的奖赏功能,我们根据时间变化分布来模拟其他团队的行为。我们用模拟了“ML-TER ” 通过模拟的2玩家搜索和翻版(MIR-TER TO) 搜索和选择动作的定位动作, 以不同的搜索和选择动作来显示最底层的回报的动作的动作, 目标是用来的恢复的代理人的目标。