The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment. In this work, we entertain an extreme scenario wherein some combination of immense environment complexity and limited agent capacity entirely precludes identifying an exactly value-equivalent model. In light of this, we embrace a notion of approximate value equivalence and introduce an algorithm for incrementally synthesizing simple and useful approximations of the environment from which an agent might still recover near-optimal behavior. Crucially, we recognize the information-theoretic nature of this lossy environment compression problem and use the appropriate tools of rate-distortion theory to make mathematically precise how value equivalence can lend tractability to otherwise intractable sequential decision-making problems.
翻译:典型的基于模型的强化学习剂迭接地完善了对环境真正基本模型的估计或先前的信念。最近基于模型的强化学习成功经验与功能近似相近,然而,我们忽略了真正的模型,而代之以一种替代模型,这种替代模型虽然忽视了环境的各个方面,但仍有助于有效规划行为。最近正式确定为价值等值原则,这种算法技术也许不可避免,因为现实世界强化学习需要考虑一个简单、计算上受限制的代理人与一个极其复杂的环境相互作用。在这项工作中,我们有一种极端的假设,即巨大的环境复杂性和有限的代理能力都完全无法确定一个精确的等值模型。有鉴于此,我们采用了一种近似等值的概念,并引入一种算法,逐步合成环境的简单和有用的近乎最佳的近似近似近似近似近似近似近似近似近似的近似近似近似近似近似近似近似近似近似环境的行为。我们认识到这个丢失的环境压缩问题的信息理论性质,并使用适当的率扭曲理论工具来精确地确定何等值等值的可及其它难以按顺序作出决定的问题。