决定采用何种模式:强化学习的等值抽样 (Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning)

The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment, whose underlying dynamics likely exceed the agent's capacity for representation. In this work, we consider the scenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality. To address this problem, we introduce an algorithm that, using rate-distortion theory, iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model. We prove an information-theoretic, Bayesian regret bound for our algorithm that holds for any finite-horizon, episodic sequential decision-making problem. Crucially, our regret bound can be expressed in one of two possible forms, providing a performance guarantee for finding either the simplest model that achieves a desired sub-optimality gap or, alternatively, the best model given a limit on agent capacity.

翻译：典型的基于模型的强化学习代理机构迭接地完善了其估计或先前对环境真正基本模型的信念。最近基于模型的强化学习成功经验,以功能近似为主,但避免了真正的模型,而代之以一种替代模式,这种替代模式虽然忽视了环境的各个方面,仍然有利于对行为进行有效规划。最近,这种算法技术作为价值等值原则正式化,也许不可避免,因为现实世界强化学习需要考虑一个简单、有计算限制的代理机构,与极其复杂的环境发生互动,其基本动态可能超过该代理机构的代表能力。在这项工作中,我们考虑到一种假设情景,即代理人的局限性可能完全排除确定一个完全等值的模式,而这种替代模式的替代模式,在忽略环境的各个方面,尽管忽视了环境的各个方面,但仍然有利于对行为进行有效规划。为了解决这一问题,我们采用了一种算法,即使用标准扭曲理论,反复地将一个大约价值的模型、损失的空隙压缩环境作为该代理机构代表能力的目标,以取代真正的模型。我们发现一个最理想的、最精确的、最精确的、最有把握的状态的演练的演算方法,可以使我们获得一个最精确的、最精确的演化的演算。