Training a multi-agent reinforcement learning (MARL) model with a sparse reward is generally difficult because numerous combinations of interactions among agents induce a certain outcome (i.e., success or failure). Earlier studies have tried to resolve this issue by employing an intrinsic reward to induce interactions that are helpful for learning an effective policy. However, this approach requires extensive prior knowledge for designing an intrinsic reward. To train the MARL model effectively without designing the intrinsic reward, we propose a learning-based exploration strategy to generate the initial states of a game. The proposed method adopts a variational graph autoencoder to represent a game state such that (1) the state can be compactly encoded to a latent representation by considering relationships among agents, and (2) the latent representation can be used as an effective input for a coupled surrogate model to predict an exploration score. The proposed method then finds new latent representations that maximize the exploration scores and decodes these representations to generate initial states from which the MARL model starts training in the game and thus experiences novel and rewardable states. We demonstrate that our method improves the training and performance of the MARL model more than the existing exploration methods.
翻译:培训多剂强化学习(MARL)模式,但奖赏微薄,一般很难,因为代理商之间互动的多种组合导致一定的结果(即成败)。早期的研究曾试图通过使用内在奖励来解决这一问题,以促成有助于学习有效政策的互动。然而,这一方法需要广泛的事先知识来设计内在奖励。为了在不设计内在奖励的情况下有效地培训多剂强化学习(MARL)模式,我们提议了一个基于学习的勘探战略,以产生游戏的初始状态。拟议方法采用一个变式图自动编码器来代表游戏,指出:(1) 国家可以通过考虑代理商之间的关系,简洁地编码为潜在代表制,(2) 潜在代表制可以作为一种有效的投入,用于共同的替代模型,以预测勘探得分。然后,拟议方法找到新的潜在表现,最大限度地提高勘探得分,解码这些表达方式,以产生初始状态,使MARL模式开始在游戏中接受培训,从而体验新的和可奖励状态。我们证明,我们的方法比现有的勘探方法更能改进MARL模式的培训和表现。