复杂行动空间的学习和规划 (Learning and Planning in Complex Action Spaces)

Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.

翻译：许多重要的现实世界问题都有高度的、连续的或两者兼有的行动空间,因此不可能充分列举所有可能的行动。相反,只能为政策评估和改进的目的对少量的行动进行抽样。在本文件中,我们提出了一个总体框架,以原则性的方式解释政策评价和改进这类抽样行动子集。这个基于抽样的政策重复框架原则上可以适用于基于政策重复的任何强化学习算法。具体地说,我们提议采用抽样的MuZero算法,即通过规划抽样行动,在具有任意复杂行动空间的领域学习的MuZero算法的延伸。我们在传统的棋盘游戏“Go”上展示了这一方法,在两个连续控制基准领域:深点控制套件和现实世界RL套件。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

不可错过！「强化学习导论」多伦多大学2021课程，附SLIDES与140页pdf

专知会员服务

67+阅读 · 2021年3月24日

【DeepMind】基于模型的强化学习，174页ppt，Model-Based Reinforcement Learning

专知会员服务

89+阅读 · 2021年1月12日

【DeepMind】强化学习教程，83页ppt

专知会员服务

158+阅读 · 2020年8月7日

【Manning新书】现代Java实战，592页pdf

专知会员服务

101+阅读 · 2020年5月22日