适当价值等值 (Proper Value Equivalence)

One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to order-$k$ counterparts defined with respect to $k$ applications of the Bellman operator. This leads to a family of VE classes that increase in size as $k \rightarrow \infty$. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. We construct a loss function for learning PVE models and argue that popular algorithms such as MuZero can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to MuZero and show that it can lead to improved performance in practice.

翻译：基于模型的加固学习(RL) 的主要挑战之一是决定环境的哪些方面应该建模。值等效( Ve) 原则建议了一个简单的答案: 模型应该包含与基于价值的规划相关的环境方面。从技术上讲, Ve 将基于一套政策和一套功能的模型区分为: 如果Bellman操作员为政策带来的模型在应用功能时产生正确的结果, 模型应该是环境的Ve。随着政策和函数数量的增加, 一套Ve模型会萎缩, 最终下降到一个与完美模型相对应的单一点。 Ve 原则背后的一个基本问题是如何选择最小的、与基于价值的规划相关的环境方面。在本文中, 我们迈出了一个重要的步骤来回答这个问题。我们首先将Ve的概念推广为一至一美元, 在应用 Bellman 操作者的 $k 应用上下定义。这导致Ve 类的提高规模, 甚至以美元计价的模型的值缩缩缩缩成, 最终的递增到一个完整的数值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值, 值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值总, 。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【2022新书】强化学习工业应用，408页pdf

专知会员服务

231+阅读 · 2022年2月3日

【斯坦福Jiaxuan You】图学习在金融网络中的应用，24页ppt

专知会员服务

45+阅读 · 2021年9月19日

【经典书】模式识别导论，561页pdf

专知会员服务

84+阅读 · 2021年6月30日

应用机器学习书稿，361页pdf

专知会员服务

59+阅读 · 2020年11月24日