Policy gradient methods are often applied to reinforcement learning in continuous multiagent games. These methods perform local search in the joint-action space, and as we show, they are susceptable to a game-theoretic pathology known as relative overgeneralization. To resolve this issue, we propose Multiagent Soft Q-learning, which can be seen as the analogue of applying Q-learning to continuous controls. We compare our method to MADDPG, a state-of-the-art approach, and show that our method achieves better coordination in multiagent cooperative tasks, converging to better local optima in the joint action space.
翻译:政策梯度方法通常用于在连续多试剂游戏中强化学习。 这些方法在联合行动空间进行本地搜索, 正如我们所显示的, 它们被一种被称为相对超一般化的游戏理论病理学所接受。 为了解决这个问题, 我们建议多试剂软性Q- 学习, 这可以被视为对连续控制应用Q- 学习的类比。 我们比较了我们的方法和MADDPG, 这是一种最先进的方法, 并表明我们的方法在多试剂合作任务中实现了更好的协调, 在联合行动空间中融合到更好的本地选择。