Reinforcement Learning (RL) methods are typically applied directly in environments to learn policies. In some complex environments with continuous state-action spaces, sparse rewards, and/or long temporal horizons, learning a good policy in the original environments can be difficult. Focusing on the offline RL setting, we aim to build a simple and discrete world model that abstracts the original environment. RL methods are applied to our world model instead of the environment data for simplified policy learning. Our world model, dubbed Value Memory Graph (VMG), is designed as a directed-graph-based Markov decision process (MDP) of which vertices and directed edges represent graph states and graph actions, separately. As the state-action spaces of VMG are finite and relatively small compared to the original environment, we can directly apply the value iteration algorithm on VMG to estimate graph state values and figure out the best graph actions. VMG is trained from and built on the offline RL dataset. Together with an action translator that converts the abstract graph actions in VMG to real actions in the original environment, VMG controls agents to maximize episode returns. Our experiments on the D4RL benchmark show that VMG can outperform state-of-the-art offline RL methods in several tasks, especially when environments have sparse rewards and long temporal horizons. Code will be made publicly available.
翻译:强化学习(RL)方法通常直接应用于环境,以学习政策。在某些复杂环境中,国家行动空间连续,奖励稀少,和(或)长时空视野不断,很难在原始环境中学习好的政策。聚焦于离线 RL 设置,我们的目标是建立一个简单和独立的世界模型,以摘要原始环境。RL 方法适用于我们的世界模型,而不是用于简化政策学习的环境数据。我们的世界模型,称为值内存图(VMG),是设计成一个定向绘图式的Markov 决策过程(MDP),其中的顶级和定向边缘分别代表图形状态和图表动作。由于VMG的状态行动空间有限,与原始环境相比相对较小的是困难的。我们可以直接应用VMG的值迭代算算算法来估算状态值,并绘制最佳的图表动作。VMG是从离线数据集(VMG)的抽象图形动作转换到原始环境中的实际动作,VMG控制代理到最高级的分级,当我们现有时,在DL环境中的Srental Foral Foral Foral 将显示我们现有的VMG4 特别的实验。