One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to order-$k$ counterparts defined with respect to $k$ applications of the Bellman operator. This leads to a family of VE classes that increase in size as $k \rightarrow \infty$. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. We construct a loss function for learning PVE models and argue that popular algorithms such as MuZero can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to MuZero and show that it can lead to improved performance in practice.
翻译:基于模型的加固学习(RL) 的主要挑战之一是决定环境的哪些方面应该建模。 值等效( Ve) 原则建议了一个简单的答案: 模型应该包含与基于价值的规划相关的环境方面。 从技术上讲, Ve 将基于一套政策和一套功能的模型区分为: 如果Bellman操作员为政策带来的模型在应用功能时产生正确的结果, 模型应该是环境的Ve。 随着政策和函数数量的增加, 一套Ve模型会萎缩, 最终下降到一个与完美模型相对应的单一点。 Ve 原则背后的一个基本问题是如何选择最小的、 与基于价值的规划相关的环境方面。 在本文中, 我们迈出了一个重要的步骤来回答这个问题。 我们首先将Ve的概念推广为一至一美元, 在应用 Bellman 操作者 的 $k 应用上下定义。 这导致Ve 类的提高规模, 甚至以美元计价的模型的值缩缩缩缩成, 最终的递增到一个完整的数值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值, 值值值值值值 值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值总, 。