We design a simple reinforcement learning agent that, with a specification only of agent state dynamics and a reward function, can operate with some degree of competence in any environment. The agent maintains only visitation counts and value estimates for each agent-state-action pair. The value function is updated incrementally in response to temporal differences and optimistic boosts that encourage exploration. The agent executes actions that are greedy with respect to this value function. We establish a regret bound demonstrating convergence to near-optimal per-period performance, where the time taken to achieve near-optimality is polynomial in the number of agent states and actions, as well as the reward mixing time of the best policy within the reference policy class, which is comprised of those that depend on history only through agent state. Notably, there is no further dependence on the number of environment states or mixing times associated with other policies or statistics of history. Our result sheds light on the potential benefits of (deep) representation learning, which has demonstrated the capability to extract compact and relevant features from high-dimensional interaction histories.
 翻译:我们设计了一个简单的强化学习工具,仅对代理人的动态和奖赏功能进行具体说明,可以在任何环境中以某种程度的能力运作。该代理只对每个代理人-国家行动对等保持访问计数和价值估计。根据时间差异和鼓励探索的乐观推动因素,价值功能会随着时间差异和鼓励探索的乐观推动因素而逐步更新。该代理实施对这一价值函数贪婪的行动。我们确立了一种遗憾,表明接近最佳的每期业绩趋于一致,在代理人的州和行动数量上,实现接近最佳程度所需的时间是多元的,以及参照政策类别内最佳政策的奖励混合时间,后者由仅依赖历史的代理人状态所组成。值得注意的是,不再依赖环境状态的数量或与其他政策或历史统计相关的时间。我们的结果揭示了(深度)代表性学习的潜在好处,这显示了从高维度互动史中提取压缩和相关特征的能力。