We consider model-based multi-agent reinforcement learning, where the environment transition model is unknown and can only be learned via expensive interactions with the environment. We propose H-MARL (Hallucinated Multi-Agent Reinforcement Learning), a novel sample-efficient algorithm that can efficiently balance exploration, i.e., learning about the environment, and exploitation, i.e., achieve good equilibrium performance in the underlying general-sum Markov game. H-MARL builds high-probability confidence intervals around the unknown transition model and sequentially updates them based on newly observed data. Using these, it constructs an optimistic hallucinated game for the agents for which equilibrium policies are computed at each round. We consider general statistical models (e.g., Gaussian processes, deep ensembles, etc.) and policy classes (e.g., deep neural networks), and theoretically analyze our approach by bounding the agents' dynamic regret. Moreover, we provide a convergence rate to the equilibria of the underlying Markov game. We demonstrate our approach experimentally on an autonomous driving simulation benchmark. H-MARL learns successful equilibrium policies after a few interactions with the environment and can significantly improve the performance compared to non-optimistic exploration methods.
翻译:我们考虑基于模型的多试剂强化学习,环境过渡模式是未知的,只有通过与环境的昂贵互动才能学习。我们建议H-MARL(Halluced多代理强化学习),这是一种新型的样本高效算法,能够有效地平衡勘探(即了解环境和开发),即了解环境和开发(即,在基础总和Markov游戏中实现良好的平衡性表现)。H-MARL围绕未知的过渡模式建立高概率信任间隔,并根据新观察到的数据顺序更新这些模型。使用这些模型,我们为每轮计算平衡政策的代理(H-MARL)构建了一个乐观的幻觉游戏。我们考虑的是一般统计模型(例如,Gausian进程,深层集合等)和政策类(例如,深神经网络),以及从理论上分析我们的方法,将代理人的动态遗憾捆绑在一起。此外,我们为基础的Markov游戏的平衡性提供了一种趋同率。我们的方法实验性地展示了自主驱动模拟基准。我们研究了H-MARop在与不成功的探索政策进行对比之后,可以大大改善环境之后,我们的方法可以与少数互动。