We design a simple reinforcement learning agent that, with a specification only of suitable internal state dynamics and a reward function, can operate with some degree of competence in any environment. The agent maintains visitation counts and value estimates for associated state-action pair. The value function is updated incrementally in response to temporal differences and optimistic boosts that encourage exploration. The agent executes actions that are greedy with respect to this value function. We establish a regret bound demonstrating convergence to near-optimal per-period performance, where the time taken to achieve near-optimality is polynomial in the number of internal states and actions, as well as the reward averaging time of the best policy within the reference policy class, which is comprised of those that depend on history only through the agent's internal state. Notably, there is no further dependence on the number of environment states or mixing times associated with other policies or statistics of history. Our result sheds light on the potential benefits of (deep) representation learning, which has demonstrated the capability to extract compact and relevant features from high-dimensional interaction histories.
翻译:我们设计了一个简单的强化学习工具,该工具仅对适当的内部动态和奖励功能进行具体说明,可在任何环境中以某种程度的能力运作。该工具保持相关州-州对应方的查访计数和价值估计。该工具根据时间差异和鼓励探索的乐观推动因素,不断更新价值函数。该工具执行对这一价值函数的贪婪行动。我们确立了一种遗憾,它表明接近最佳的每期业绩趋于一致,其中实现接近最佳性所需的时间是内部状态和行动数量的多元性,以及参考政策类中最佳政策的平均奖励时间,该类政策由仅依赖历史的该代理人的内部状态构成。值得注意的是,不再依赖环境状态的数量或与其他政策或历史统计相关的时间的混合。我们的结果揭示了(深层)代表性学习的潜在好处,表明有能力从高层次的互动历史中提取紧凑和相关特征。