Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.
翻译:许多重要的现实世界问题都有高度的、连续的或两者兼有的行动空间,因此不可能充分列举所有可能的行动。相反,只能为政策评估和改进的目的对少量的行动进行抽样。在本文件中,我们提出了一个总体框架,以原则性的方式解释政策评价和改进这类抽样行动子集。这个基于抽样的政策重复框架原则上可以适用于基于政策重复的任何强化学习算法。具体地说,我们提议采用抽样的MuZero算法,即通过规划抽样行动,在具有任意复杂行动空间的领域学习的MuZero算法的延伸。我们在传统的棋盘游戏“Go”上展示了这一方法,在两个连续控制基准领域:深点控制套件和现实世界RL套件。