The current landscape of multi-agent expert imitation is broadly dominated by two families of algorithms - Behavioral Cloning (BC) and Adversarial Imitation Learning (AIL). BC approaches suffer from compounding errors, as they ignore the sequential decision-making nature of the trajectory generation problem. Furthermore, they cannot effectively model multi-modal behaviors. While AIL methods solve the issue of compounding errors and multi-modal policy training, they are plagued with instability in their training dynamics. In this work, we address this issue by introducing a novel self-supervised loss that encourages the discriminator to approximate a richer reward function. We employ our method to train a graph-based multi-agent actor-critic architecture that learns a centralized policy, conditioned on a learned latent interaction graph. We show that our method (SS-MAIL) outperforms prior state-of-the-art methods on real-world prediction tasks, as well as on custom-designed synthetic experiments. We prove that SS-MAIL is part of the family of AIL methods by providing a theoretical connection to cost-regularized apprenticeship learning. Moreover, we leverage the self-supervised formulation to introduce a novel teacher forcing-based curriculum (Trajectory Forcing) that improves sample efficiency by progressively increasing the length of the generated trajectory. The SS-MAIL framework improves multi-agent imitation capabilities by stabilizing the policy training, improving the reward shaping capabilities, as well as providing the ability for modeling multi-modal trajectories.
翻译:目前多试剂专家模仿的景观基本上由两种算法-行为克隆(BC)和Audversarial Limitation(AIL)组成。 BC方法存在复杂的错误,因为它们忽视了轨迹生成问题的顺序决策性质。 此外,它们无法有效地模拟多模式行为。虽然AIL方法解决了混合错误和多模式政策培训的问题,但它们在培训动态方面受到不稳定的困扰。在这项工作中,我们通过引入一种新的自我监督损失来解决这一问题,鼓励歧视者接近一个更富的奖励功能。我们利用我们的方法来培训一个基于图表的多试剂演员-critic 结构,以学习集中的政策为条件。此外,我们展示了我们的方法(SS-MAIL)在现实世界预测任务以及定制的合成实验模型上超越了先前的先进方法。我们证明SS-MAIL是AIL方法的一部分,为成本-正规化的学徒培训提供理论联系,通过不断提高的模范式学习能力,我们通过不断提高的模范板化的模化的模化的模版学习能力,将自我调整的模版化的模版化的校程框架用于不断提高教师的校程效率。