Deep reinforcement learning repeatedly succeeds in closed, well-defined domains such as games (Chess, Go, StarCraft). The next frontier is real-world scenarios, where setups are numerous and varied. For this, agents need to learn the underlying rules governing the environment, so as to robustly generalise to conditions that differ from those they were trained on. Model-based reinforcement learning algorithms, such as the highly successful MuZero, aim to accomplish this by learning a world model. However, leveraging a world model has not consistently shown greater generalisation capabilities compared to model-free alternatives. In this work, we propose improving the data efficiency and generalisation capabilities of MuZero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by MuZero are equivariant to a particular symmetry group acting on the environment, the entirety of MuZero's action-selection algorithm will also be equivariant to that group. We evaluate Equivariant MuZero on procedurally-generated MiniPacman and on Chaser from the ProcGen suite: training on a set of mazes, and then testing on unseen rotated versions, demonstrating the benefits of equivariance. Further, we verify that our performance improvements hold even when only some of the components of Equivariant MuZero obey strict equivariance, which highlights the robustness of our construction.
翻译:更深的强化学习在诸如游戏(Chess, Go, StarCraft)等封闭的、定义明确的领域(Chess, Go, StarCraft)中多次成功。下一个前沿是真实世界的情景,即设置数量众多且多种多样。为此,代理商需要学习环境的基本规则,以便严格地概括到不同于他们所培训的条件。基于模型的强化学习算法,如非常成功的MuZero,目的是通过学习一个世界模型来实现这一目标。然而,利用一个世界模型,与不使用模型的替代方法相比,并没有一贯表现出更稳健的概括能力。在这项工作中,我们建议通过明确将环境的对称纳入其世界建模架构来提高Muzero的数据效率和总体化能力。我们证明,只要Muzero使用的神经网络与在环境上特别的对称性小组不相适应,Muzero的行动选择算法的完整也只能与这个组相比更加灵活。我们用程序生成的Merero(equiro)的微缩缩缩缩图和Csero(degraphal Gro)的自制版本的改进,然后我们用来测试了我们的软缩缩缩缩缩造造造型系统。