The field of General Reinforcement Learning (GRL) formulates the problem of sequential decision-making from ground up. The history of interaction constitutes a "ground" state of the system, which never repeats. On the one hand, this generality allows GRL to model almost every domain possible, e.g.\ Bandits, MDPs, POMDPs, PSRs, and history-based environments. On the other hand, in general, the near-optimal policies in GRL are functions of complete history, which hinders not only learning but also planning in GRL. The usual way around for the planning part is that the agent is given a Markovian abstraction of the underlying process. So, it can use any MDP planning algorithm to find a near-optimal policy. The Extreme State Aggregation (ESA) framework has extended this idea to non-Markovian abstractions without compromising on the possibility of planning through a (surrogate) MDP. A distinguishing feature of ESA is that it proves an upper bound of $O\left(\varepsilon^{-A} \cdot (1-\gamma)^{-2A}\right)$ on the number of states required for the surrogate MDP (where $A$ is the number of actions, $\gamma$ is the discount-factor, and $\varepsilon$ is the optimality-gap) which holds \emph{uniformly} for \emph{all} domains. While the possibility of a universal bound is quite remarkable, we show that this bound is very loose. We propose a novel non-MDP abstraction which allows for a much better upper bound of $O\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot A \cdot 2^{A}\right)$. Furthermore, we show that this bound can be improved further to $O\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot \log^3 A \right)$ by using an action-sequentialization method.
翻译:通用强化学习( GRL) 领域从地面上构建了顺序决策的问题。 互动的历史是系统“ 地面” 状态, 从未重复。 一方面, 这种一般性允许 GRL 几乎每个可能的域都建模, 例如: 盗匪、 磁碟、 POMDPs、 POMDPs 和基于历史的环境。 另一方面, 一般而言, GRL 中接近最佳的政策是完整历史的函数, 不仅阻碍学习, 也阻碍 GRL 的规划。 规划部分的通常方式是给该代理一个 Markovian 抽象的基础进程。 因此, 它可以使用任何 MDP 计划算法来寻找接近最佳的政策。 极端国家聚合(ESA) 框架将这个概念扩展为非马尔科匹亚的抽象, 而不会影响通过一个( 缩略图) mDP 更深入地显示 $( varcial) 的域域( 美元) 和 数字( mcial- dirmax) 显示( 美元) 数字。