Learning in multi-agent environments is difficult due to the non-stationarity introduced by an opponent's or partner's changing behaviors. Instead of reactively adapting to the other agent's (opponent or partner) behavior, we propose an algorithm to proactively influence the other agent's strategy to stabilize -- which can restrain the non-stationarity caused by the other agent. We learn a low-dimensional latent representation of the other agent's strategy and the dynamics of how the latent strategy evolves with respect to our robot's behavior. With this learned dynamics model, we can define an unsupervised stability reward to train our robot to deliberately influence the other agent to stabilize towards a single strategy. We demonstrate the effectiveness of stabilizing in improving efficiency of maximizing the task reward in a variety of simulated environments, including autonomous driving, emergent communication, and robotic manipulation. We show qualitative results on our website: https://sites.google.com/view/stable-marl/.
翻译:多试剂环境中的学习是困难的,因为对手或伙伴不断变化的行为引入了非常态性。我们提出一种算法,以主动地影响其他代理人(反对者或伙伴)的稳定战略 -- -- 它可以抑制另一个代理人造成的非常态性。我们了解到另一个代理人的战略的低维潜值代表,以及潜伏战略在机器人行为方面如何演化的动态。有了这种学习的动态模型,我们可以确定一种不受监督的稳定奖,以训练我们的机器人,蓄意影响另一个代理人稳定地采取单一战略。我们展示了稳定在各种模拟环境中,包括自主驾驶、紧急通信和机器人操纵,提高任务奖励效率的效率。我们在网站上展示了质量成果:https://sites.gogle.com/view/stable-marl/。